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Preface 


Deep learning has entered a period of designing, implementing, and deploying 
intensive and diverse applications, which are now visible in numerous areas. 
Successful case studies became a consequence of the prudent, carefully drafted 
fundamental concept that transformed ways in such real-world problems are per- 
ceived, formalized, and solved at the increased level of machine-centered and 
automatic fashion. Central to all pursuits are algorithms that help realize the prin- 
ciples of deep learning in an efficient way. Algorithms deliver a sound alignment 
with the specificity of the applied nature of the practical problem by addressing the 
computing requirements and selecting/adjusting overall algorithmic settings. 

The volume is composed of 11 chapters and reflects the wealth of algorithms of 
deep learning and their application studies in the plethora of areas including 
imaging, seismic tomography, power series forecasting, smart grids, surveillance, 
security, health care, environmental engineering, and marine sciences. 

We would like to express our thanks to the Series Editor, Prof. Janusz Kacprzyk. 
He has always been enthusiastic and highly supportive of this project. We are 
indebted to the professionals at Springer; the team has made the overall production 
process highly efficient and completed in a timely manner. 


Edmonton, Canada Witold Pedrycz 
Taipei, Taiwan Shyi-Ming Chen 
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Abstract Activation functions lie at the core of deep neural networks allowing them 
to learn arbitrarily complex mappings. Without any activation, a neural network learn 
will only be able to learn a linear relation between input and the desired output. 
The chapter introduces the reader to why activation functions are useful and their 
immense importance in making deep learning successful. A detailed survey of several 
existing activation functions is provided in this chapter covering their functional 
forms, original motivations, merits as well as demerits. The chapter also discusses 
the domain of learnable activation functions and proposes a novel activation ‘SLAF’ 
whose shape is learned during the training of a neural network. A working model for 
SLAF is provided and its performance is experimentally shown on XOR and MNIST 
classification tasks. 


Keywords Activation functions - Neural networks - Learning deep neural 
networks - Adaptive activation functions - ReLU - SLAF 


1 Introduction 


Neural Networks (NNs) are powerful information processing tools and can learn data 
representation [1] implicitly and efficiently. As a result, these networks have been 
shown to give excellent performance on tasks like statistical pattern recognition, 
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Fig. 1 Cartoon relating neuron and a neural network component 


classification, time series prediction, etc., [2, 3]. NNs are functionally inspired from 
working of a human brain. As there are trillions of neurons in a brain heavily inter- 
connected to each other, so are in NNs, providing path for the information to flow 
through. Similar to human beings, NNs can also learn from examples and can make 
predictions/decisions based on the observed trends. Analogous to firing of only spe- 
cific neurons in the brain, only some of the nodes in any NN’s hidden layer are 
activated in response to an input stimuli. This firing of neuron comes from the term 
action potential [4]. It plays an important role in cell to cell communication by assist- 
ing the propagation of signals along the neuron’s axon. A very simplistic model is 
shown in Fig. | (derived from [5]). A neuron receives signals (xo) from the other neu- 
ron’s axon terminals, which are collected through dendrites. Theses signals undergo 
multiplicative interaction in dendrites (Woxo) based on the synaptic strength (Wo). 
Note that W’s can be excitory when the W’s are positive as well as inhibitory when 
W’s are negative. If the final sum obtained (2 W;x; + b) is greater than the threshold, 
then the neuron fires. This frequency of firing is modeled by an activation function in 
NN. This model doesn’t generalize to all kinds of neruons and takes certain assump- 
tions which might not be satisfied in actual neurons. Interested readers can refer to 
[6] for more insights. 

One of the early form of neural networks is Perceptron. It was developed at 
Cornell Aeronautical Laboratory by Frank Rosenblatt [7] for image classification 
task. The weights of the network were stored in physical potentiometers, and the 
learning occurred with the help of electric motors which would update these weights 
during the training phase of the neural network. Technically, the perceptron was 
designed to learn a mapping from input space, 1.e., pixels of image to a binary 
Space corresponding to the class that image belongs to. To realize this mapping, a 
threshold function with only two possible outcomes was required. Hence, the step 
function was used at the output which gave a unit value for positive inputs and zero for 
negative input values. Later, when the stochastic gradient descent [8] became popular, 
a differentiable and continuous approximation of step function called sigmoid took its 
place. It is a doubly saturating non linear function whose output is bounded between 
zero and one. This gave rise to logistic regression which became the de facto for 
classification tasks. Unlike the step function which was a hard classifier, the output 
of sigmoid can be interpreted as the probability of belonging to a particular class. 
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Fig. 2 a Linear decision boundary for XOR problem b Best decision boundary for XOR problem 


Though simple and easy to use, logistic regression exhibits a major downside as 
well when the classes are not linearly separable. To better understand linear separa- 
bility of classes, let’s look at the XOR (exclusive OR) classification problem. It is 
easy to see that it is not linearly separable, (Fig. 2) and a linear classifier would only 
classify seventy five percent of the points correctly (Fig. 2a). On the other hand, the 
perfect decision boundary would have to be non linear. Figure 2b shows one possi- 
ble boundary for XOR problem defined by two straight lines dividing the complete 
features space into three regions. The middle region belongs to class, labelled as one 
and the other two belong to the class, labelled as zero. To increase the representation 
capability of NNs such that they can learn non linear boundaries, a hidden layer 
is introduced between input and output which readers have learned in the earlier 
parts of the book. This provides the ability to learn more complex and highly non 
linear functions, to the neural networks (NN) which in the chapter is informally 
called as an increase in the capacity of NNs. It is well known that increasing width 
and depth increase the representation capacity of NN in general. Hence, it becomes 
important to ask this question, ‘does the choice activation affect the relationship 
between capacity and width or depth of a deep neural network (DNN)?’. Universal 
function approximation theorem says that we can learn any function considering a 
wide enough architecture with any of the accepted activations. But does there exist 
a better activation function, and if it does, how do we characterize it? 

Another objective of this chapter is to allow readers to not only choose the correct 
activation function, but rather understand the reasons as to why a certain activation 
might perform better. It is well known that (Long Short Term Memory) LSTMs, 
variants of recurrent neural networks, use hyperbolic tan as the activation function and 
sigmoid activation function for gating mechanism. Since sigmoid has significance in 
terms of its interpret-ability as probability, it is intuitive to use it for gating or binary 
classification problems. In the later part of chapter, we will understand that sigmoid 
suffers from vanishing gradient problem. Hyperbolic tan on the other hand gives 
convergence in lesser iterations due to its mathematical properties which is why it 
is used in LSTMs over sigmoid. Such properties are important to understand for a 
researchers developing new architectures and activations. 
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1.1 Drawbacks of Fixed Activation Functions 


Tuning the width and depth of the network, applying regularization penalties 
[9, 10] and using normalization methods [11] have been shown to help in improving 
the performance of Deep Neural Networks (DNNs). Another important parameter 
that affects the performance of NNs 1s the type of activation function. As we will in 
the later parts of the chapter, no two activations performing equivalently even with 
same architectures, it might become critical to choose the right activation function. 
Hence, it can be seen as a hyper parameter which can be tuned specifically for the 
task and data set. One possible way is to try all existing activation functions on the 
architecture and compare their performance. However, this will be impractical when 
we consider using different activation function at each layer, or maybe at even each 
node. Another problem that one can be pointed out in fixed activation functions is 
that they will inevitably end up providing some non-linearity in the architecture. For 
example, it is impossible to learn identity operation using neural networks. Keeping 
the activations as fixed, it is impossible to recover input without loss of information. 
This can be seen as an important issue with auto encoders [12]. Even with high 
dimensional encoding, the resulting output is always lossy. One might say that, had 
the activation function been linear, this problem wouldn’t occur. These factors serve 
as motivation for using and studying adaptive/learnable activation functions. In the 
last section of the chapter, we will see the design of such activation functions and 
understand their practical importance. 


2 Existing Activation Functions 


In this section, we will try to understand various activation functions present in the 
literature. We will study how they were proposed, why are they advantageous and how 
newer activations, developed onto them. Though we are far from fully understanding 
what actually happens inside a neural networks, but mathematical analysis of these 
non linearities can provide proper basis and intuition of what might be happening 
inside neural networks. One must understand activation functions not just as a tool 
to make NNs work but also their theoretical implications. 


2.1 Linear Activation 


Neural networks equipped with Linear Activation functions are generally called as 
linear neural networks. They have been widely studied to systematically study the 
learning dynamics of deep learning [13, 14]. They have analytic importance and 
are useful to understand the nature of neural networks. Linear neural networks with 
multiple hidden layers are equivalent to neural networks of single hidden layer. 
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Fig.3 a Linear activation function and its gradient b Binary step activation function and its gradient 


They are used for linear regression tasks and a closed form solution exists for such 
networks. Linear activation functions are defined as 


y=X, (1) 


where, x is the input to activation function and y is its output. It is clear from Eq. (1) 
that the gradient of linear activation functions is unity. Figure 3a shows the activation 
function and its gradient pictorially. 


2.2 Binary Step Activation 


A binary step function is similar to a threshold function that can be used for the 

purpose of binary classification tasks [15, 16]. Although, this function is quite old, 

it still has historical importance in machine learning. It is used in classification tasks 

done with signal processing methods. It is mathematically defined (for threshold at 0) 

as: 

B. Step(x) = px hes 
—-l] ifx<0 


Its gradient can be written as: 


d(B. Step(x)) _ Undefined ifx = 0 
dx «(10 if otherwise 


This function is non-differentiable at zero (or generally threshold) and has zero 
gradient at all other points. Hence, it is used in conjunction with stochastic gradient 
descent. This led to discovery of a smoother, differentiable approximation known as 
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sigmoid through which gradients could flow backwards. Let’s see what sigmoid and 
its multi-class variant look like. 


2.3 Sigmoid and Softmax 


Sigmoid was originally designed for binary classification but now has wide applica- 
tion in tasks related to attention models, and bounded output regression as well [17, 
18]. Mathematically, sigmoid is defined as: 


sigmoid(x) = a(x) = lte- 
e Xx 


The output of sigmoid lies in the range [0, 1] and hence, it can be interpreted as 
probability. Its gradient is interestingly easy to compute as it can be expressed in 
terms of the original activation itself: 


d(a(x)) e* 
——_——— = ——_ = o(x) -(l-o@ 

fe 7 Wee TI Ao) 

The major problem which we will discuss below with sigmoid is the vanishing 
gradient problem. The gradient of sigmoid activation function as shown in Fig. 4 van- 
ishes away from zero and is upper bounded by value 0.25. This characteristic makes 
the training of deep neural networks slower when all its activations are sigmoid. 
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Fig. 4 Sigmoid activation function and its corresponding gradient 
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Before we discuss performance details of sigmoid, one would ask this question. 
Why would anyone specifically use only this form of sigmoid? There are multiple 
ways to approximation binary step function. So why this? To answer that, we can 
look at the argument proposed in [19]. 

Consider a generative model for generating data denoted by random variable X , 
where, x (one realization of X) can either belong to class C; or Cz. This model is 
expressed by the prior probability distribution P(C,), P(C2) and class conditional 
probability density functions p(x|C,), p(x|C2). Now, the posterior estimates on prob- 
ability of an input belonging to either of the class and, can be written with the help 
of Bayes’ theorem as: 


P(x|C,) - P(C)) 





pC, |x) = — 
Ww" p@elCi) « P(C1) + pxlC2) - PCa) 
2 ee (2) 
1 + exp(—a) 
= o(a) 
where, a = In(e ae ), and is frequently referred to as ‘logits’. 


The above idea is extended to write the posterior probability for multiple classes 
(let’s say K classes (Cj, Co,...,Cx)). This gives rise to the softmax activation 
function [19]: 

PRC)“ PCe) 

Ey POC) - PC) 

_ -exp(—Zx) (3) 
int CXP(—Z) 

= softmax(Z) 


P(Ck |x) = 


where, z is ak-dimensional vector containing K logits, one for each class. This wraps 
up the two most important activation functions sigmoid and softmax. The gradient 
of softmax can also be written in terms of original function itself. 
d(softmax(zi)) — d(s(zi)) — Js(i)-U—-s)) fi=J (4) 
dz; dz; —S(Zi) + $(Z) ifi FJ 


Given such important meaning associated with sigmoid activation functions, their 
use inside hidden layer was initially a heuristic (inspired biologically). Later, it 
was found that sigmoid led to vanishing gradient problems. For cascaded sigmoid 
activations across m hidden layers of one node each (calling the outer most layer as 
h,, its activated output as 0; and the incoming weight of the edge as w 1, such that 
hy = Wz * Og41 and og = o(h,)), the corresponding gradient at kth layer is: 


do, do, dog doy 


I a ie «a 5 
a ae ie ae 9) 
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Now, each Fi is derivative of sigmoid which is upper bounded by a value of 0.25. 
These successive multiplications lead to a smaller and smaller value in shallower lay- 
ers. This makes the learning slower for the shallower parameters. This disadvantage 
of sigmoid activation functions can be overcome by using hyperbolic tanh activation 
function. It has an interesting mathematical properties even its characteristics are 
close to sigmoid activation function. 


2.4 Tanh (Hyperbolic Tan) 


Hyperbolic tan is another activation function that can also be called as symmetric 
sigmoid [20]. It is a zero centered, doubly saturating activation function (saturating 
away from zero in both directions). It is mathematically defined as: 


Xx —X 


e* —e 
tanh(x) = ———— 6 
anh(x) ee (6) 
and its gradient can be written as 
d (tanh 4 
I gay GS 0 (7) 
dx (eX + e-*)? 


The above equations show that the gradient of tanh is upper bounded by one, which 
is four times that of sigmoid activation function. Therefore, due to larger gradients, 
symmetric sigmoids are often seen to converge faster (Fig. 5). 
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Fig. 5 Tanh activation function and its corresponding gradient 
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Another property which contributes to superior performance of tanh compared 
to sigmoid activation function would be its zero centered nature. This means that if 
we uniformly sample points from JR, then apply tanh function on them, the output 
mean would be theoretically zero. The argument from [21] given below will help us 
to understand how this leads to the faster training of neural networks. 


Proposition 1 Consider anon linearity or activation function, denoted as f (.), being 
used in a neural network, trained with stochastic gradient descent. If the activation 
function is not zero centered, or )°.-»f (x) #0, then the activation function will 
result in biased gradient and hence network will train slowly and reach optima in 
more iterations. 


Proof Consider a multi-layered neural network with N hidden layers, with non linear 
vector transformation defined as F(.). This transformation takes in a vector, and 
applies an activation function f(.) on its each component. The relation between 
(n — 1)th layer (with k inputs) of neural network to nth layer (with r inputs) can be 
written as: 

Yn = WrXn-1 (8) 


where, W,, is the weight matrix of shape (r x k), and X,, 1s the nth hidden layer after 


applying activation F'(.) on Y,. The derivatives can be calculated from the following 
recurrence relation. 














Safi (10) 
Ay! =f'On Axi 

k 
OEP - OEP (12) 


pull "By 


Here, x', y', w'’ represent the components of vectors X , Y, and matrix W respectively 
and E is the objective functions. In case where f (.) is sigmoid activation function, 


Xn,—1 Will always be positive. From Eq.(12), it can be any inferred that wi > 
O Wij €{d,...,r),U,...,k)} if the partial derivative >O Wee {l,...,k}. 


Therefore, gradients for the matrix defined as, wit Vi,7€{dU,...,r),d,...,k)} 
will only point in the first quadrant or the third quadrant, where, r = 1 and k = 2. 
To change the direction of this row vector the path followed by stochastic gradient 
descent is a zig-zag only. More generally, this effect is seen when the activation func- 
tion is not zero centered. Generally, activation functions are biased to either positive 
or negative values. This will be translated to the gradients eventually, resulting in 
a longer convergence time. Hence, it is preferable to use zero centered activation 


Oyi 
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functions. This nature can be explicitly provided by using normalization tech- 
niques, which make activations zero centered at each layer where normalization 
is applied. LJ 


Hyperbolic tan is a zero centered activation function and has major application in 
recurrent neural networks, where the problem of vanishing gradient is much more 
prominent. One should take note that if the gradient of an activation function is more 
than one, this would result in a phenomenon termed as exploding gradients [22]. 
This leads to instability during training and never achieves steady state convergence. 
Techniques such as gradient clipping [22] ensure that the gradient is appropriately 
scaled to lower values in events of gradient explosion. 


2.5 The ReLU Family 


ReLU is one of the most popoular and widely used activation function. Many versions 
of the ReLU activation function have been proposed. Below we discuss the activation 
functions belonging to this family. 


2.5.1 ReLU 


ReLU was proposed in [23, 24], and was shown to stabilize the feedback in analogue 
circuits. They proved that under certain conditions, networks equipped with ReLU 


non linearity would always converge to a steady state. Mathematically ReLU is 
defined as, 


ee > 0) 
RUG =nEoa= 72 (13) 

0 ifx <0 

Its gradient is given as, 
d(ReLU (x)) sagas 
eLU (x 
—— 0 ifx <0 (14) 
me 


undefined ifx =O 


ReLU is a continuous function and differentiable at all points except at zero. 
It might seem that ReLU will not be compatible to stochastic gradient descent, 
However, there is only a infinitesimally small probability of landing up to an actual 
zero input to this activation function, given random initialization of architectures 
as well as presence of bias term before applying activation. In practice, ReLU is 
the most widely adopted activation function. There are several reasons that provide 
advantage to the ReLU activation (Fig. 6). 
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Fig. 6 ReLU activation function and its corresponding gradient 


1. Gradient stability: From Eq. (5), it is easy to see that no matter how deep we 
go, the gradient of ReLU (when input is positive), is always one and hence the it 
will not vanish. A more generic formulation can be written using Eqs. (10), (11), 
(12). Hence, the case where the gradient of ReLU dies will happen when all of 
the ReLU outputs are zero across the hidden layer. 

2. Computationally cheaper: Other activation functions require evaluation of 
exponents and performing division operations. However, ReLU is a simpler 
because it only requires a max operation to generate output. The same argument 
is valid for the gradient calculation. This is another reason, ReLU is preferable 
over other complex activation functions. 

3. Sparsity: ReLUs result in sparsity in layers as if the input becomes zero, a con- 
nection becomes irrelevant to the model. This allows for analyzing the features, 
or variables which are important. However, if it actually makes the generalization 
or performance better is still a question. 


In contrast to the above advantages of ReLU, there exists some drawbacks as well. 
ReLU can result in dead neurons, if the output of a particular activation becomes 
zero, then its gradient might die forever. Since the gradient flowing backward would 
also be zero, it might be possible that a neuron which could have contributed greatly, 
might never recover from that stage. Another problem with ReLU is its non-zero 
centric nature, because of which the activations are biased to being positive. Had the 
ReLU been zero centered, it certainly would have accelerated the training. 
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Fig. 7 a LReLU and its corresponding gradient. b PReLU with a = 0.3 and its corresponding 
gradient 


2.5.2 Leaky ReLU 


A new activation function called leaky ReLU [25] is used to avoid dying problem in 
ReLU. The major disadvantage with ReLU is that it saturates at zero whenever it is 
not activated. This leads to zero gradient and leads to slower training. To solve this 
issues, a leak is added to the activation instead of hard zero. Mathematically LReLU 
is defined as: 

x ifx > 0 


LReLU (x) = an ifx <0 (15) 


This leak helps to increase the range of the ReLU. The gradient of the leaky ReLU 
is obtained as 
ee ifx > 0 ae 


dx — 10.01 ifx <0 


The characteristics of Leaky ReLU and its gradient are shown in Fig. 7a. 


2.5.3 PReLU (Parametric ReLU) 


Parametric rectified linear unit (PReLU) is a new generalization of ReLU introduced 
in [26]. It was originally defined for convolutional neural networks (CNNs). The 
PReLU is: 
; fx > O 
PReLU(x)) =}! UF (17) 
Qix, ifx <0 


where, «; 1S a parameter that is learned during training. Here, the subscript i refers to 
the ith channel in the convolutional neural network. For each feature map, a; 1s shared, 
thereby reducing the chance of over-fitting. LReLU uses a; = 0.01, but PReLU 
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adaptively learns the slope of the negative part. There’s a trade off in computational 

cost when using PReLU with the motivation to learn better and specialized activation. 

The gradient of the PReLU 1s 
dPReLU (x;) fl fx 20 (18) 

dx; — Qj if Xi< 0 

The characteristics of PReLU and its gradient are shown in Fig.7b. The update 

formulas of parameter a; are derived from chain rule. To write the update rules for 

the parameters a;, consider the equation below, 


(19) 


j 7 eLU (x;) Oa; 


OE 3 OE OPReLU (x;) 
Oa — OPR 


where, F represents the objective function. The term SPR is the gradient back 
propagated from the deeper layers. The gradient of the activation with a; 1s given by 


OPReLU (xi) _ ° if x; > 0 os 


Oa; x; ifx; <0 


The gradient of a, when it is shared across both channels as well as feature maps is 
OE OE OPReLU (x;) 
“ane ee 21 
Oa d du OPReLU (x;) Oa aa 


The authors of PReLU activation function suggest using momentum optimizer for 
updating parameters a;. Hence, the update at nth iteration can be given as 





OE 

da” = pda"! + (22) 
Oa; 

a” = al! — dat (23) 


where, jz represents momentum and ¢ learning rate. Since, there is no constraint on 
the sign of a;, PReLU does not necessarily need to be monotonic. 


2.5.4 ELU (Exponential Linear Unit) 


ELU was proposed by Clevert et al., [27] as an improvement to ReLU activation 
function and its variants (LReLU and PReLU). ELU activation function is defined 
as 
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Fig.8 a The exponential linear unit (ELU) with a = | and its corresponding gradient. b The scaled 
exponential linear unit (SELU) with a = 1.6733 and A = 1.0507, with its corresponding gradient 


|x ifx > O 
elu(x) = ale’ —1) ifx <0 (24) 


where, @ 1s a hyper parameter that controls the value for negative inputs. The gradient 
of ELU is given by 


delu(x) _ 1 ifx > 0 
dx — Jelu(x) +a ifx <0 





(25) 


ELU also solves the problem of vanishing gradient similar to ReLU by having unit 
gradient for all positive inputs. Whereas, its gradient is non-zero for negative inputs 
pushing the mean of the activation function closer to zero. As discussed in Proposition 
1, this leads to faster training of the neural network. Major improvement of ELU 
over LReLU and PReLU comes from its saturating behavior for negative inputs. 
This brings relatively less variation in activation function which in turn makes its 
noise robust. Moreover, the ELU activation has shown improved results over ReLU 
activation on both supervised and unsupervised machine learning tasks. Figure 8a 
shows ELU activation function and its gradient for hyper parameter a = 1. 


2.5.5 SELU (Scaled Exponential Linear Unit) 


It is well known that, even though FNNs (Fully Connected Neural Networks) are 
highly sophisticated machine learning algorithm, they fail to stand up to their repu- 
tation in real life. However, with the support of CNNs and RNNs, NNs can achieve 
state of the art results. This can be attributed to CNNs and RNNs having parame- 
ter sharing across feature maps and cells respectively combined with normalization 
techniques like batch normalization and layer normalization. Since both image and 
time series data have structure across space and time respectively, these parameters 
are efficiently shared, and exponentially reduce the complexity. In [28], authors find 
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that the reason behind lacking performance of FNNs is their high variance across 
different training examples and sensitivity to perturbations. This brings us to the 
concept of SNNs. SNNs (Self Normalized Neural Networks) keep normalization of 
activation function when propagating them through layers of networks. SNNs need 
two things to work, first is a custom weight initialization and SELUs as activation 
function. The weight initialization allows the mean and variance at each layer to be 
zero and one respectively. 

SNNs can not be derived with ReLUs, sigmoid, tanh and leaky ReLU. SELUs are 
obtained by multiplying the exponential linear unit with a parameter A which is kept 
greater than | to ensure slope greater than one. The authors of SELU provide four 
conditions which any activation function should ideally possess (which SELUs do 
follow) are: 


1. Both positive and negative values should be present in the range of the activation 
function so that the mean 1s zero. 

2. Saturating regime in the activation function to dampen or reduce the variance of 
the output of activation. 

3. A regime with slope greater than one so as to increase the variance if needed by 
the network. 

4. The activation function should be continuous. 


SELU (Scaled Exponential Linear Unit) is mathematically defined as 


selu(x) = “ pana” (26) 
Aa(e*—1) ifx <0 


where, a and A are two fixed parameters derived from the input. For standard scaled 
inputs authors provide optimal value of a as 1.6733 and X as 1.0507. The gradient 
of SELU can be written as 





(27) 


dselu(x) _ r ifx > 0 
dx |\a+t+ selu(x) ifx <0 


The characteristics of SELU are shown in Fig. 8b. SELU has self normalizing prop- 
erty that allows to train networks with high learning rate that have many layers. This 
activation function has no exploding and vanishing gradient problems. 


2.6 Softplus 


Softplus is another activation proposed in [24]. Itis a smooth approximation of ReLU 
which is max(0, x). It is mathematically defined as: 


Softplus(x) = log( + e*) (28) 
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Fig. 9 Softplus activation function and its corresponding gradient 


Gradient of softmax can be written as a sigmoid unit: 


d(Softpl ‘i I 
a 29 


Softplus activation function 1s computationally complex but provides many advan- 
tages over other activations. Since the derivative of a softplus unit is a sigmoid, it 
does not suffer with the vanishing gradient issue. The gradient does not die in almost 
half of the whole real domain. Unlike ReLU, the gradient does not become zero 
instantly when the input is negative. So, the neural network does not have many dead 
neurons. As mentioned in ReLU, it is questionable to quantify the effect sparsity in 
neural networks. In the next subsection, we will discuss maxout networks, that have 
no sparsity in activation, however, they perform better than ReLU activation (Fig. 9). 


2.7 Maxout 


Maxout was proposed by Goodfellow [29], as a completely different class of acti- 
vation function. We first see the mathematical description of maxout and then its 
specific benefits. We consider a hidden layer with t neurons, where each neuron is 


denoted as h; (i € {1, ..., t}). The neuron h; after applying maxout can be written as: 
hj = max Zz (30) 
Jél,...k 


Li x Woy + bij (31) 
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Fig. 10 Image Courtesy: [29]. How maxout can approximate arbitrary uni-variate functions. Sim- 
ilarly, multivariate functions can also be approximated by maxout 


where, input x € R%, W is the weight tensor of shape (d x t x k) and b is a matrix 
of shape (t x k). In case of convolutional neural networks, the maxout unit 1s imple- 
mented across k different feature maps. It will amount to maxpooling operation across 
spacial domain and maxout across channels. Reference [29] mentions that similar 
to multilayer perceptron network, a maxout network is also an universal function 
approximator. They show that, given enough affine components (¢ of a layer), k = 2 
is sufficient to approximate any arbitrary continuous function. They use the fact that 
any continuous function can be approximated (with error €) using piece wise linear 
function. It is important to note that maxout will learn a piece wise linear function, 
which will serve as an activation function to the neural network as shown in Fig. 10. 

Maxout networks are also advantageous because of their compatibility with 
dropout regularization technique. Dropout is basically nullifying the neurons of a 
network, randomly. It is implemented by masking the hidden layer with a same 
dimensional tensor of zeros and ones. Dropout results in an ensemble of multiple 
weaker models providing regularization effect. The output of a maxout, when trained 
with dropout, changes relatively rarely with different masks of dropout. Maxout net- 
works basically enlarge their linear regions, so that even if some units are dropped, 
the input to maxout still is such that it falls in the same linear region. This invariance 
is easily achieved with maxout networks making it an excellent activation function. 
One of the disadvantage of maxout networks is the need of larger architecture sizes 
to allow for implementation of maxout operation. The extra dimension k, increases 
the parameters of the model to some extent. However, their performance is remark- 
able considering such a non-trivial form of non linearity when when compared with 
ReLUs. 


2.8 Swish Activation Function 


So far, we have seen various activations that have been mostly hand-designed to 
improve certain characteristics of ReLU activation function. In [30], authors present 
a method to discover novel activation functions using automated reinforcement 
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Fig. 11 a The Swish activation function. b First derivative of swish 


learning based search. The core idea 1s to design a composite combination of various 
existing activation functions and empirically compare them to find the best one. In 
other words, design a search space consisting of composite combination of existing 
functions and then test each composite combination on a standard data set to com- 
pare them with each other. The crux of the method lies in designing the composite 
combination of existing functions. Using an exhaustive search, 1.e., trying all com- 
binations of activation functions would result in a very large search space making 
the search practically infeasible. Hence, in [30], authors used a RNN controller to 
predict different components of an activation function and feed it back to predict the 
other components of the same new activation function. 

Now, once an activation function has been found, it is tested on CIFAR-10 data 
set using a child network. A list of top performing functions is maintained to keep 
track of best performing activations. This method resulted in various novel activation 
functions which outperformed ReLU on CIFAR-10 data set (explained in 3) atleast 
using the child network. It was found that the activation function f (x) = x - a((Gx) 
outperformed ReLU on both CIFAR-10 and CIFAR-100 on various Deep Architec- 
tures. The function f (x) = x - a((@x) is called Swish Activation function. Here, 3 
is a constant or trainable parameter and o(z) = (1 + exp(—z))~!. Figure 11 shows 
swish activation function and its first derivate for various values of (@. It can be 
seen that for G = 0, f(x) = 5 and behaves similar to identity function. As G > o, 
f (x) — max(0, x) or f(x) acts as ReLU activation function. This suggests that the 
swish activation function can be loosely viewed as a smooth version of the ReLU 
activation function. 


3 Comparison of Activation Functions 


After understanding various existing activation functions, it is important to see how 
these activation functions perform with different architectures. To compare them, we 
present the results of these activation functions on CIFAR-10 and CIFAR-100 data 
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sets using 3 different state of the art architectures. CIFAR-10 and CIFAR-100 data 
sets contain 60,000 colored images belonging to 10 and 100 classes respectively. 
50,000 of these images belong to training set and the rest 10,000 are the test images. 
The task is to accurately classify the test images based on training data set. The 3 
different architectures used are ResNet-164 [31], Wide ResNet 28-10 (WRN) [32] 
and DenseNet 100-12 [33]. The results shown are taken from [30]. 

Tables | and 2 showcase the test accuracy of several non linear activation functions 
on three different architectures. It is evident that no single activation works best on 
every architecture. For example, in case of CIFAR-10 data set, Softplus outperforms 
all other activation functions when ResNet is used as the underlying architecture but 
performs poorly on Wide ResNet as compared to other activations. This creates a 
dilemma around how to select the optimal activation function for any architecture. 
Hence, we want an activation function which can adapt itself depending on the data 
set and architecture. In the next section, we will focus on such activation functions 
and discuss two different approaches that can be used to learn activation functions. 


Table 1 CIFAR-10 test accuracy 


Table 2 CIFAR-100 test accuracy 


20 M. Goyal et al. 


4 Learning Activation Functions 


Till now, we have discussed various activation functions starting from Identity func- 
tion to Swish. In this section, we shift our focus from fixed activation functions to 
those which can be learned. Learning the activation functions means the shape of the 
function is not fixed but is paramterized by learnable weights which are learned during 
training of NN. Owing to their capability of adaptation, they are also known as ‘Adap- 
tive Activation Functions (AAFs)’. References [34—37] show different approaches 
to design AAFs. Below, we discuss two separate approaches to learn activation func- 
tions. The first method uses an ‘Adaptive Piecewise Linear Activation Function’ [37] 
which is learned independently for each neuron using gradient descent. Next, we pro- 
pose a unique technique which aims to learn an activation function using techniques 
of non linear approximation. We call it ‘Self Learnable Activation Function (SLAF)’. 


4.1 Learning Adaptive Piecewise Linear (APL) Activation 
Functions 


As the name suggests, this method learns a piecewise linear activation function for 
each neuron in the neural network. It formulates an activation function h(x) as, 


s=S 
h(x) = max(0, x) + )— afmax(O, —x + b') (32) 
s=1 
where, S' is a hyper parameter and the variables a; and b; fori € 1, ...., S are learned 


using the same algorithm as the other network weights are learned. Note that the 
method aims to learn the best piecewise linear function for a given data set and 
architecture. Hence, Eq. (32) should span the entire space of continuous piece wise 
linear functions. 


Theorem 1 Any continuous piecewise linear function g(x) can be expressed by 
Eq. (32) for some S if it satisfies the following two conditions: 


I. There exists a scalar u such that g(x) = x forx > u 
2. There exists two scalars a and v such that V,g(x) = a forall x <v 


We are not providing the proof of above theorem but reader may refer to [37] for 
further details on above result. Figure 12 shows APL activation function when sum- 
mation in Eq. (32) has only 1 term. It is very interesting to note all the curves except 
(b) show non-monotonic behavior of activation function which is contrary to the 
behavior observed in fixed activation function (other than swish activation function). 
Moreover, Fig. 12b shows the non-convex behavior of the activation function. This 
show the freedom of APL activation functions to adapt themselves depending on 
the task. Above activation function also outperforms ReLU on various data sets. As 
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Fig. 12 Adaptive piece wise linear activation function with different a, b parameters for S = 1 


mentioned in [37], the best error rate on CIFAR-10 using APL activation function is 
7.51% whereas ReLU had an error rate of 7.73% using Network in Network (NIN) 
architecture [38]. Similarly, on the same architecture for CIFAR-100 data set, APL 
outperforms ReLU by around 2% by achieving an error rate of 30.83%. 

Above method enables neural networks to learn diverse set of activation functions. 
All these learned activation functions will be piecewise linear. Hence, the search 
space explored by this method is limited to only piecewise linear functions. Below, 
we discuss a more generalized method of learning activation functions which doesn’t 
make any inherent assumptions about the nature of activation functions. 


4.2 SLAF: Learning Non-linear Approximation of Activation 
Functions 


Every continuous function in a function space can be written as a linear combination 
of basis functions in the same function space [39]. If d's are the basis elements, then 
the function f (x) with input x can be expressed as, 
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f(x) = )_ aidi(x) (33) 
i=0 


Here, a;s are the coefficients of basis elements and are unique to f(x). If we fix 
the basis function and learn their coefficients using a suitable algorithm, we can 
effectively learn f (x). The only problem here is that the expression contains infinite 
elements and it is practically impossible to learn all of them. Restricting the number 
of elements in the basis results in an approximation of the function. A suitable 
approximation of a function f (x) with NV basis elements can be given by: 


f &) © ando(x) + adi (x) +... + ay-1dy_-1@) =f (x) (34) 


where, ¢; are the basis elements and q; are the corresponding coefficients. If we take 
f (x) to be the activation function which we aim to learn, learning {do,..., @y_—1} 
would eventually learn f (x). Since, this method intends to learn the approximation 
f (x) and not the actual function f (x), 1t becomes a prime concern to find a basis 
which provides a good approximation with N basis elements. Although there can be 
many choices for basis functions, we use Even Mirror Fourier Non-linear (EMFN) 


Filter Basis owing to its strong approximation capabilities. 


4.2.1 Even Mirror Fourier Non-linear (EMFN) Filter 


EMEN filters [40] can be used for approximation of any continuous function f (x) in 
the interval [—1, 1]. So, an extension of f (x) in the EMEN basis on entire real axis IR 
is considered by taking its periodic even mirror repetition. To do this, the values of 
f (x) lying between [—1, 1] are taken and repeated on entire real line. The repetition 
is done to satisfy the following two properties, 


fd+x) =fd —x) : Even Mirror of f (x) 35) 
f(x) =f (x +4) : Periodic extension 
The EMEN filters use sine and cosine functions as basis elements. Since, f (x) 1s 
periodic with period 4, it is easy to write the Fourier series expansion of f (x). The 
Fourier series expansion contains the following basis elements, 


7 . Fh 37 
{l, cos (<x) , sin (<x) , COS (7X) , SIN (7X) , COS | —X ], 
2 2 2 


(36) 
.f OF 57 
sin (.) , cos (27x) , sin (27x) , Cos (=) Aan 


Now, since the basis elements {cos (5x) , SIN (1X) , COS (=2 x) , sin (27x)} don’t 


satisfy even mirror property of f(x), they can be removed from the basis function. 
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Hence, the resultant basis and the corresponding function approximation can be 
given as: 


9 2 9 9 2 9 9 eee / 
2 2 _ 


It is also possible to find an extension of above basis function for approximation 
of function in N dimensions but we will restrict our discussion to 1-dimension due 
to the lack of relevance. 


4.2.2 Model Setup 


We want to use EMEN Filters approximation method to learn activation functions 
which can adapt itself depending on the task and architecture. First, we replace the 
existing pre-defined “Activation-Function” block by ‘Self-Learnable Activation- 
Function (SLAF)’ block. This ‘Self-Learnable Activation-Function block’ should 
eventually learn a good approximation of the optimal activation function for any given 
network. Let the optimal activation function for the network be Ff’. An approximation 
of F can be given by Eq. (38). The coefficients in Eq. (38) can theoretically represent 
F with arbitrarily small error. Moreover, for a fixed input and fixed basis function 
(EMEN basis), the only variable in Eq. (38) are the coefficients. Hence, if we learn 
these coefficients, we would eventually learn the optimal activation function. So, we 
can now narrow our goal of learning activation functions to simply finding a model 
where we can plug the approximation equation in place of activation function and 
learn the right set of coefficients along with other network weights. 

Let x be the input to the activation function and f (x) denote its output. We know 
that domain of approximation of f(x) using EMEFN filters is [—1, 1]. In general, 
the input to an activation function in a neural network can take any real value in 
the range (—oo, oo). Hence, before using the EMFN filter approximation, we first 
need to restrict the input x to the range [—1, 1]. This can be done by dividing the 
input tensor X (X is the tensor representation of x, containing B x’s, where B is 
the batch size) by its absolute maximum value (across batch for each feature). This 
would bound the range of transformed tensor between [—1, 1]. We also divide this 
transformed tensor by a learnable parameter m to further narrow its range between 
[—1, 1]. This is done to avoid abrupt behavior of EMFN filters around x = {—1, 1}. 
This gives us the transformed tensor x 


P xX 


xX (39) 


~ m+ max(|X |) 


It is possible that during training phase, max(|X |) might be 0 or a very small positive 
quantity. This would lead to division by 0 in Eq. (39). Hence, we add a small learnable 
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Fig. 13. SLAF model using 
EMEN filter 
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parameter € in the denominator of Eq. (39). This gives us the final transformed tensor 
X which is defined as follows: 
X 


xX = —_____ (40) 
m-max(|X |) +€ 


We keep both m and € as learnable parameters as they both are data set dependent. 
Now, we can use this scaled input tensor for approximation of our activation function. 
Using this, we can define our activation function as, 


f (x) = fstar (X) 


; : ; (41) 
Sstar (x) = Wodo(X) + +++ + Wr-1Gn-1@) 
where W;s are learn-able and ¢;s belong to the EMFN basis and x is one element of 


X (which has the same shape as X ). Note that, W;'s will be shared across the complete 
tensor X to avoid over-fitting. This is pictorially depicted in Fig. 13. 


4.2.3 Training Routines 


EMEN filters have strong expressive power which means that the coefficients of 
basis elements learned by model can highly over fit to training data set resulting in 
poor generalization. To avoid this, we need improved training routines. We propose 
following methods which can be used along with SLAF to improve the generalization. 


1. Regularization is a standard technique to reduce over fitting in machine learn- 
ing. L2 regularization [9] is used on the coefficients being learned. Reference [41] 


Activation Functions 25 


states the problem in using L2 regularization with Adam optimizer [42]. Hence, 
if Adam optimizer is being used for optimizing network weights, a separate opti- 
mizer is used for regularization loss. The experiments in below sections have used 
SGD optimizer for the regularization loss and adam optimizer for minimizing the 
cross entropy or mean squared loss depending the task. 

2. Learning rate decay: Learning rate decay [43, 44] is essentially very important 
for most of the tasks when using the self-learnable activation function in the neural 
network. This helps in avoiding local minima. 

3. Tuning the number of basis elements: The number of basis elements change 
the expressive power of the network. High number of basis elements can not only 
lead to over fitting but also raise convergence issues. The experiments in the below 
subsection have used only 3 or 4 basis elements. 


4.2.4 Experiments 


In this subsection, we present series of experiments where the results of fixed activa- 
tion function and self-learnable activation function are compared for different tasks. 


1. XOR: XOR is a logical operator which takes its two binary inputs and its ouput 
is also a binary value. Table 3 shows this classification (“x” and “‘o” denote two 
separate classes). It is clear that XOR operation is not linearly separable and 
therefore, it is impossible to learn it without using hidden layer or simply with 
a “Perceptron’. The architecture used for this experiment first takes a weighted 
combination of inputs and then apply an activation function (acting as a non- 
linearity) on the output of this weighted combination. The main reason for using 
this sort of architecture is to see whether the existing activation function have 
enough capacity to learn this decision boundary or not. Table 4 shows that the 
maximum accuracy that can be achieved using ReLU is 75% whereas SLAF can 
classify this data set with 100% accuracy. This is because SLAF can adapt to the 
task depending on the data set. The final decision boundaries learned by both the 
activation functions are shown in Fig. 14. 

2. MNIST: MNIST data set contains 70,000 images of 28 x 28 pixels each con- 
taining a hand written digit from 0 to 9. The task is to classify these images into 
10 classes depending on the letter written in image. We use Convolutional Neural 
Network (CNN) [45] consisting of 2 convolutional layers followed by 2 fully 
connected layers to train our model. The architecture uses 3 activation functions. 
We replace all three activation functions by SLAF. We used L2 regularization on 
SLAF weights and learning rate decay, and achieve an accuracy of 99.46%. The 


Table 3) XOR operator Oo | 1 
xis 
! oR 
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Table 4 Results on XOR problem. k,, kz are the number of elements used for sine and cosine terms 
respectively 


Activation function Accuracy 
ReLU 0.75 
EMEN filter kj = 3 ko = 3 1.0 


(a) (b) 


Epoch-2 Epoch-4 Epoch-6 


Epoch-8 Epoch-10 


Epoch-14 Epoch-16 Epoch-18 Epoch-14 Epoch-16 Epoch-18 


Aaal/ 7 


Fig. 14 Comparison of decision boundaries learned with training epochs by using two different 
activation functions. ‘o’ (dot) refers to the class labeled as zero and ‘x’ (cross) refers to the class 
labelled as one. a Using ReLU b Using SLAF 
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Fig. 15 Test accuracy versus iterations on MNIST data set using different activation functions 
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maximum accuracy achieved using ReLU on the same architecture was 99.34%. 
Figure 15 shows that the testing accuracy using SLAF is almost always better 
than ReLU. Moreover, the accuracy curve of neural network using SLAF had 
negligible fluctuation curve as compared to ReLU showing the much stronger 
generalization capability of SLAF. 


5 Conclusion 


In this chapter, we first understood neural networks were biologically motivated 
and then understood the role of activation functions in neural networks. Activation 
functions are non linear mappings from one subspace to another, allowing neural 
networks to learn arbitrarily complex functions which forms the basis of deep learn- 
ing. In the second part, we studied a detailed description of all popular activation 
functions that have been proposed so far, and saw how this area has been developing 
with rather less mathematical evidences. The nature of these functions heavily affect 
the training dynamics which is why researchers have been focusing on designing 
better activations for the past decade. Concepts such as normalization of activations, 
exploding and vanishing gradients have served as the key to these feats in the syn- 
thesis of newer activations. We also learnt that different activation functions lead 
to different performances of neural networks. In the later parts of the chapter, the 
paradigm of learnable activation functions was introduced. We discussed the adap- 
tive piece wise activation and a novel self learnable activation function. Both of 
the approaches stressed on developing a way to learn the activation functions while 
training the neural network depending on the kind of data set, task and the network. 
Since it is well known that a high variation in performance is experienced while 
switching between different activation functions, it becomes incorrect to claim that 
a fixed activation in all circumstances will outperform any other activation on every 
task/dataset. Hence, an algorithm is proposed which can search for the best activa- 
tion in the entire space of continous functions. The readers should take a note that 
the algorithm proposed in this chapter, 1.e., self learnable activation functions uses 
even mirror fourier nonlinear filter basis in its current form but it may not be the per- 
fect way to learn the best activation. However, it indeed points towards a promising 
direction which might completely change our perception of neural networks. There 
are yet many unknowns which act as barriers for complete understanding of neural 
networks. Activation function yet remains to be the biggest black box which has 
been driven through isolated concepts, or biological analogies. For faster develop- 
ments in this field people look through the lens of better empirical performance, but 
an approach for learning activations was introduced in this work with the aim of 
creating a mathematical basis behind these non linearities. 
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6 Future Works 


The chapter highlights the importance of learnable activation functions and discusses 
two completely different methods to learn them. The performance of SLAF is shown 
over simple data sets. To empirically validate the usefulness of SLAF, one must 
conduct experiments on more complex data sets such as CIFAR-100 and Image Net. 
The basic methodology used by EMEFN filters to learn non-linear approximation of 
activation functions is discussed in Sect.4.2. Different basis functions can be used 
in place of EMEN filters to empirically find the most suitable one. A proper model 
setup must be designed for every basis. More training routines, such as, applying 
dropout and different regularization techniques on the activation coefficients can be 
proposed to achieve faster optimization of the neural networks. The chapter focuses 
on only one block of DNNs, viz. activation functions. One can always extend the 
concept of learning to other non-linear components present in the neural networks. 
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Abstract Deep learning architectures are vulnerable to adversarial perturbations. 
They are added to the input and alter drastically the output of deep networks. These 
instances are called adversarial examples. They are observed in various learning tasks 
from supervised learning to unsupervised and reinforcement learning. In this chapter, 
we review some of the most important highlights in theory and practice of adversar- 
ial examples. The focus is on designing adversarial attacks, theoretical investigation 
into the nature of adversarial examples, and establishing defenses against adversar- 
ial attacks. A common thread in the design of adversarial attacks is the perturbation 
analysis of learning algorithms. Many existing algorithms rely implicitly on pertur- 
bation analysis for generating adversarial examples. The summary of most powerful 
attacks are presented in this light. We overview various theories behind the exis- 
tence of adversarial examples as well as theories that consider the relation between 
the generalization error and adversarial robustness. Finally, various defenses against 
adversarial examples are also discussed. 
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1 Introduction 


Artificial intelligence is on the rise and Deep Neural Networks (DNNs) are an impor- 
tant part of it. Whether it is in speech analysis [29] or visual tasks [26, 34, 58, 68], 
they shine with a performance beyond what was imagined a decade ago. Their suc- 
cess is undeniable, nevertheless a flaw has been spotted in their performance. They 
are not stable under adversarial perturbations [69]. Adversarial perturbations are 
intentionally worst case designed noises that aim at changing the output of a DNN 
to an incorrect one. The perturbations are most of the time so small that an ordinary 
observer may not even notice it, and even the state-of-the-art DNNs are highly con- 
fident in their, wrong, classification of these adversarial examples. This phenomena 
is depicted in Fig.1, borrowed from [24], where a subtle adversarial perturbation is 
able to change the classification outcome. Robustness to adversarial perturbations 
is different from robustness to random noise [19], a trait that can be achieved by 
DNNs. The existence of adversarial perturbations was known for machine learning 
algorithms [9], however, they were first noticed in deep learning research in [69]. 
These discoveries generated interest among researchers to understand the instability 
of DNNs, to explore various attacks and devise multiple defenses. Although it is 
very difficult to keep up with the pace of results in this area, there are many excel- 
lent surveys on the topic. For instance, the surveys [1, 78] cover many interesting 
instances for which adversarial examples exist. In this chapter, we overview as well 
some of the most important findings regarding adversarial examples for DNNs. How- 
ever, we adopt a different approach. Instead of creating a catalog of existing attacks 
and defenses, we present an adequately general framework which can recover many 
existing attacks. Theoretical findings regarding the nature of adversarial examples 
are additionally addressed. In this light, we address three problems in this chapter, 
namely, adversarial attacks, their theoretical explanation and adversarial defenses. 
The first question is about generating adversarial examples and designing attacks. 
This is discussed in the first part of this chapter. Historically these examples were first 
found for classification tasks and were based on first order approximations of DNNs. 
These methods require knowledge of model parameters and are therefore sometimes 
called white-box attacks. We overview some of the most important attacks including 


+ O07 x 





“panda” perturbation “oibbon” 


Fig. 1 A demonstration from [24] of adversarial examples generated using the FGSM. By adding 
an imperceptibly small vector, we can change GoogLeNet’s classification of the image 
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iterative and non-iterative methods, as well as single and multiple pixel attacks. 
Instead of listing different attacks, our goal is to present a unifying framework for 
generating adversarial examples. The framework, which goes beyond classification 
problems, is based on a convex optimization formulation of adversarial input gener- 
ation. We overview, furthermore, black-box attacks where only partial knowledge of 
model parameters is available for generating adversarial examples. Universal adver- 
sarial perturbations and the transferability of adversarial examples are other topics 
discussed in this part. 

The second question is about the nature of adversarial examples. Why are DNNs 
and other machine learning models vulnerable to adversarial examples? In the sec- 
ond part, we overview some of the attempts to investigate theoretically this ques- 
tion. In many works, the adversarial vulnerability is attributed to some properties 
of machine learning models. Some examples are linearity of models, curvature of 
decision boundaries of classifiers and low €;-norm of weight matrices. After review- 
ing some of these theories, we discuss statistical learning theoretic approaches that 
explore the relation between adversarial robustness and generalization capabilities 
of machine learning models. Out of this study come new guidelines for designing 
adversarially robust algorithms, which brings us to the third question of this chapter. 
How can we design effective defenses against adversarial examples? 

The defenses take up different approaches from modifying the training process 
by changing the training set to adding new regularizations or considering new DNN 
architectures or a combination of preceding approaches. Some of the most recent 
contributions in this direction are discussed in the last part. 


1.1 Notation and Preliminaries 


We introduce first the notation used in this chapter and some of the basic definitions 


needed throughout this chapter. The letters x, y,... are used for vectors, A, B,... 

for matrices and 1, Y,... for sets. We denote the set {1,...,} by [n] forn EN. 

For any vector xX = (X1,..., x,)' € R" and p €N, the €,-norm of x is defined by 
1/p 


n 
; P 
ese ae 
al 


When p tends to zero, the above definition converges to the number of non-zero 
entries of the vector. This is called, with an abuse of terminology, the €y-norm. The 
explicit definition is given as 


Ixllo = > 1G; 4 0). 


i=] 
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The €9-norm gives the sparsity order of the vector x. The £,..-norm of a vector x is 
obtained when p — ov. It is defined as 


IIX|loo = max |x; |. 
ie[n] 


The norm of a matrix X € R”*” is similarly defined. The Frobenius norm of X is 
denoted by ||X||- and defined as: 


1/2 


Xr = | >) D> Xi; 


i=1 j=1 


The Shatten p-norm of the matrix X is equal to the €,-norm of the singular value 
vector (01, .--, Omin(m.n)) Of X, namely: 


min{m,n} 1/p 
Xl, =( > a) . 


i=] 


The Frobenius norm of X is equivalent to the £2-norm of the singular value vector. 
The €;-norm of the singular value vector is called the nuclear norm. The €9-norm is 
similarly defined and gives the rank of X. 

Consider a function f : IR” — IR” given by f(x) = (f1(X),..., fin(X)) for m 
function fj : IR” — IR. The Jacobian of f at x is denoted by J/+(x) and defined as 


O fi 
= «| 
Xj i€[n], j€[m] 





i= (£0), — of (x)) = 


2 Adversarial Perturbation Design 


Adversarial attacks follow ubiquitously the same pattern. An adversarial attacker is 
assumed to have access to the system input. This can be the input of DNNSs. It applies 
perturbations to the system inputs under an additional and important constraint. The 
perturbations should be restricted in some sense. For image-based tasks, this means 
that an ordinary observer should not be capable of spotting, at least immediately, a 
significant change in the image and its label. More generally, this constraint makes it 
hard for the administrator to detect the perturbations. Finally, and most importantly, 
the system performance, for example its classification accuracy, should be severely 
degraded. The attacks in [24, 47, 57] follow similar guidelines. Two categories of 
adversarial attacks can be envisaged, white-box and black-box attacks. In white-box 
attacks, the architecture of the target algorithms are known to the attacker, although 
there are attacks with only partial knowledge of the architecture. In contrast stand 
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black-box attacks, which require no information about the target neural network, see 
for instance [62]. 

In the pioneering work of [69], the attack is based on finding adversarial perturba- 
tions that maximize the prediction error at the output. The perturbations are approx- 
imated by minimizing the £2-norm of the perturbation. If the multi-class classifier 
mapping is defined by f : R” — [K], Szegedy et al. [69] minimize the €2-norm of 
the perturbation 77 such that the classifier output is changed to the target label/ € [K], 
1e., f(x + 7) = 1. The perturbation 77 is constrained to be inside the box [0, 1]”. The 
adversarial example is obtained by adding the perturbation 77 to the input vector x. In 
the next attack, the FGSM in [24], the sign of the gradient of the cost function is used 
for designing perturbations which were scaled to have bounded £,,-norm, and there- 
fore to be almost undetectable. If the cost function used for training is given by c(x), 
the perturbation is given by 7 = esign(Vc(x)). The €..-norm of the perturbation 
is «. An example of the FGSM is shown in Fig. |. Iterative procedures or random- 
izations can significantly strengthen adversarial attacks. An iterative linearization 
of the DNN is proposed in the algorithm DeepFool [47] to generate minimal £,,- 
norm perturbations for p > 1. The iterative approach continues to add perturbations 
with bounded ¢,-norm until the classifier’s output is altered. An iterative version 
of FGSM, called Basic Iterative Method (BIM) is proposed in [35]. The Projected 
Gradient Descent (PGD) attack is an extension of previous techniques, proposed in 
[43], where randomness is additionally introduced in the computation of adversarial 
perturbations. The PGD attack can bypass many defenses and is employed in [43] to 
devise a defense against adversarial examples. An iterative algorithm based on PGD 
combined with randomization is introduced in [5] and has been used to dismantle 
many defenses so far [4]. Another popular way of generating adversarial examples is 
by constraining the €9-norm of the perturbation. Manipulating only few entries, these 
types of attacks are known as single pixel attacks [66] and multiple pixel attacks [52]. 

In what follows, to generate adversarial examples, we provide a unifying frame- 
work that incorporates the above techniques. The main ingredient of this framework 
is perturbation analysis. Given a classifier function, the perturbation analysis of this 
function quantifies how much its output is perturbed when a known perturbation is 
applied to its input. An approximation of this output error is usually obtained using 
a first-order Taylor approximation of the function, under the assumption that the 
input perturbations are of small norms. Adversarial examples suitably fall into this 
framework, as they are perturbed versions of original inputs, the perturbations are 
small and the function at hand comes naturally from the model. Consider, for exam- 
ple, the FGSM given in [24]. The proposed attack aims at maximizing the training 
loss function that is approximated by its first-order Taylor approximation. Similarly, 
the authors of [24, 47] constructed adversarial examples by maximizing the error, 
on a relevant function, that occurs as a consequence of input perturbations. Iterative 
methods like the DeepFool method [47], the BIM [35], the PGD method [43], and the 
gradient-based norm-constrained method (GNM) [8], maximize the output perturba- 
tion using successive first order approximations. A summary about the connections 
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and differences between these methods is provided in [8]. It is based on this frame- 
work that we formulate the problem of generating adversarial examples in this 
section. 

Let us first fix the terminology used in this section. The input of classifiers is 
denoted by x. Then, adversarial examples are constructed by adding an adversarial 
perturbation 7), of the same dimension as x, to that input. For a multi-class classi- 
fication with K classes, a classifier maps inputs to the discrete set of labels [K ]. 
Classifiers modeled by DNNs based their decision usually on a set of functions, 
often differentiable, known as score functions. These functions can replace the non- 
differentiable classification function for which a first-order Taylor approximation is 
not possible because the gradients are not properly defined. The score functions and 
classification functions are defined below. 


Definition 1 (Score functions and classifier functions) A classifier is defined by the 
mapping k : R” — [K]that maps aninputx € R” toits estimated class k (x) € [K]. 
The mapping k(-) is itself defined by 


k(x) = argmax { fi (x)} , (1) 
le[K] 


where f(x) : R” — R’s represent the probability of class belonging. The function 
f (x) given by the vector (f(x), ..., fx (x))' is known as score function and can be 
assumed to be differentiable almost everywhere for many classifiers. 


Finding adversarial examples amounts to finding a perturbation that changes the 
classifier’s output. However, since they are imperceptible, such adversarial perturba- 
tions should not modify the inputs significantly. The undetectability of adversarial 
examples can be better understood using image classification tasks as an example. 
For instance, in Fig. | we observe that the human eye can not distinguish between 
the original and adversarial image. A common way to impose this restriction is by 
constraining adversarial perturbation to belong to a certain set of unnoticeable per- 
turbations. For example, the authors of the FGSM bounded the £,.,-norm of their 
perturbation, or in the DeepFool method, the norm is incrementally increased until 
the output classifier changes. Note that DeepFool may produce perceptible pertur- 
bations, while the FGSM may not fool the classifier. 

Another way of imposing undetectability of adversarial examples is to impose 
on the input perturbation to preserve the outcome of the ground truth classifier [74], 
also known as oracle classifier. In many applications, the oracle classifier refers to 
the human brain. Similar to Definition 1, denote the score function of the oracle 
classifier as g : R“ — R*, which outputs a vector with entries g, : R“” — R for 


/=1,..., K. The adversarial perturbation 77 is said to be undetectable if 
Ly (X, ) = Je (& +) — ee, (x+7) > 0. (2) 


Using this notion, the problem of finding adversarial examples amounts to the 
following. 
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Definition 2 (Adversarial Generation Problem) For a givenx € R” , the adversarial 
generation problem consists of finding a perturbation 7 € R” to fool the classifier 
k(-) by the adversarial sample x = x + 77 such that k(x) 4 k(x) and the oracle clas- 
sifier is not changed, 1.e., 


Find: 7 
s.t. L(x, 7) = fre (K +7) — max f(x + < 0 
f(X%,) = fk (K+ 7) max fil n) 3) 
LyX, ) = 9ew~ (KX + 9) — ae (x+7) > 0 


However, since the oracle classifier is usually unknown, this problem is not 
interesting for practical purposes. To overcome this issue, it is shown in forthcom- 
ing sections how the solution of this problem can be approximated by tractable 
relaxations. 


2.1 White-Box Attacks 


The white-box setting corresponds to the scenario when the classification function 
f (-) and input x are both known to the attacker. Thus, adversarial perturbations are 
designed with full knowledge of the target system. 


Non-iterative Methods As discussed above, the constraint on the oracle function 
of (3) cannot be computed in practice, since the oracle classifier is not available. 
To address this problem, such constraints are approximated by restricting the set of 
possible adversarial perturbations to a known subset. The most common choice is to 
restrict 7 to belong to the set of vectors with bounded ¢,,-norm for p > 1. The values 
of p are restricted to be p => 1 so that the set ||77||, < € is convex for any ¢« > 0. 
Note that the choice of p will determine the structure of the obtained adversarial 
examples. The case of p = oo has been the focus of research in recent years. Even 
after replacing the oracle constraint on (3), with a convex one, the problem remains 
non-convex. For the case of white-box attacks, a similar relaxation can be carried 
out by approximating L ; (x, -) with its firs-order Taylor expansion. This is possible 
since we assume to have full knowledge about x and the function f(-). 

To that end, the first-order Taylor expansion of L (x, -) around 0 leads to 


L (x, n) = L(x, 0) + 9'VyL f(x, 0) + O(N II5), 


where O(||7)|| *) contains higher order terms. Therefore, by replacing the oracle func- 
tion constraint in (3) with ||7)||, < ¢, for sufficiently small « € R*, we get 


Find: 7 
st. Lp (x,0) + ' Vn L(x, 0) <0, IInllp Se. (4) 
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which is a relaxed version of the problem exposed in (3). This formulation of the 
problem can be used to construct well known existing adversarial attacks from the 
literature. This will be discussed in detail in Sect. 2.1. Nevertheless, the following 
theorem shows that this problem is not always feasible. 


Theorem 1 The optimization problem (4) is not feasible if for q = = 


eE|| VL (x, 0) ||, < Le(x, 0). (5) 


The proof can be obtained from the results in [27], as well as in [7]. 

The theorem points to the insight that there might be no perturbation that is 
small enough and yet changes the output label. This implication befits the intuition 
that it should not be expected to fool a classifier for an arbitrarily small € with the 
perturbation’s norm constraint. This result suggests that a feasible problem can be 
obtained if we only impose one of the constraints while trying to preserve the other 
one as much as possible. To that end, a proper objective function that penalizes the 
deviation from the original constraint is minimized. This gives rise to the following 
two problems, as feasible counterparts of (4). 

First, the norm-constraint in (4) is imposed resulting in the following optimization 
problem, called GNM in [7]. It minimizes L ¢ (x, 0) + n! VL ¢ (x, 0) as 


min L(x, 0) + 9'VyL (x, 0) st. |lnllp <e.- (6) 


Using this approach we can find the best possible perturbation under the norm- 
constraint. However, a proper value for « must be chosen beforehand to guarantee 
that the perturbations remain unnoticed. Moreover, this problem has a closed form 
solution which can be computed efficiently, as stated in the following theorem. 


Theorem 2 /f V,L f(x, 7) = (See ee ane), the closed form solution to 


the minimizer of the problem (6) is given by 


n= -sign(V,L (x, 0)) © |VnL p(x, 0)|4' (7) 


i eer a | 
VinL px, 0) ING 


forg = =a where sign(-) and | - |{~' are applied element-wise, and © denotes the 
element-wise (Hadamard) product. Particularly for p = o, we have q = | and the 
solution is given by the following 


7 = —esign(V,L f(x, 0)). (8) 
The proof can be found in [7]. One advantage of using (6), besides having a closed- 


form solution, is that additional constraints on the perturbation can be added to the 
problem. In addition, the solution shown in (7) can be reused for other choices of 
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L ¢(x, -), which can be more suitable depending on the scenario. For instance, the 
FGSM chooses L ¢(x, -) to be the negative of the loss function used for training, 
which is often the cross-entropy loss in classification problems. Then, minimizing 
L ¢(X, -) corresponds to maximizing the loss. A caveat is that using problem (6) 
ensures perturbations with bounded norms, but such perturbations may not be able 
to fool the classifier. 

A second approach for relaxing (4) into a feasible problem 1s to keep the constraint 
regarding L -(x, -) and minimize over the norm of 77. Therefore, the problem of (4) 
is replaced by 


min |I77|Ip_ set L (x, 0) + n'VyL 7 (x, 0) < 0. (9) 


This approach is used by [47] on every iteration of the DeepFool algorithm (more 
details in Sect. 2.1). Similarly to (6), this problem has a closed for solution as well, 
which is given in the following theorem. 


Theorem 3 /f V,L f(x, 7) = (Soe ee ae), the closed form solution to 
the problem (9) is given by 


L f(x, 0) q-l 
———_—_—_sign(VnL ¢ (x, 0)) © |VnL ¢ (x, 9)| (10) 
|| Vin L ¢ (x, OD IIa 


forg = mare 

Observe that the perturbation from Theorem3, similar to the solution in 
Theorem 2, is nothing but an adjusted version of the gradient of the classifier with 
a different norm. The perturbation in (10) might grow unbounded to ensure that the 
classifier is misled, which makes it perceptible by the oracle. There are other similar 
methods for computing adversarial examples that depend on a first-order approx1- 
mation of other performance-related functions. These algorithms are later shown to 
be slight variations of the methods presented in this section. Furthermore, using the 
present formulation, we can build iterative procedures by repeating the optimiza- 
tion problem until the classifier output changes. In Sect.2.1, we compare different 
methods, which are formulated as iterative versions of (6) and (9). 

There are some methods that rely on adding randomness in the generation process. 
The PGD attack, from [43], is one well known example. For the PGD attack, the first- 
order approximation is taken not around 7 = 0, but instead, around a random point 
7 with its norm bounded by some €, that is € = ||77| p < €. In short, the objective 
function L ¢ (x, -) is approximated by its linear counterpart around the point 77, which 
lies within an €-radius from 7 = 0. The distribution 7 can be arbitrarily chosen as 
long as the norm constraint is not violated. A common choice is to use the uniform 
distribution over the set of vectors with bounded ¢, norm. We denote this technique 
as dithering. Moreover, displacing the center of the first order approximation from 
(0 to 7) does not lead to solutions which differ from the ones given so far. This is true 
since L r(x, n) © L¢(x,m) + () - ™)' VL ¢(X, 7) leads to the following problem 
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min L p(X.) + (9 ~ A)" VnL p(& A) st. IMlp Se 
which corresponds to solving 


min! VnL (x, 7) st. IImllp S€- (11) 


When training models with adversarial examples, it is advantageous to add ran- 
domness to their computation in order to increase the diversity of the adversarial 
perturbations during the training [72], as done with dithering technique in quantiza- 
tion literature. Further details about training with adversarial examples are discussed 
in Sect. 4.1. 


Single Subset Attacks. In the field of image recognition, it is also common to model 
undetectability by restricting the number of pixels that can be altered by an attacker. 
Single and multiple pixel attacks are introduced to this end. For the case of gray- 
scale images, altering only one value of the input vector is equivalent to a single pixel 
attack. This is, however, not a general rule. If inputs are RGB images, each pixel will 
be defined as a subset of three values. 

Since adversarial attacks go beyond image based systems, we allude as single 
subset attacks to those whose target is only a subset of entries. Given that perturbations 
belong to R™, let us partition [M] = {1,..., M} into S possible subsets S,, ..., Ss. 

These sets may be of different size, but for the sake of clarity let us assume that 
they have the same cardinality Z = M/S, where S, = re one ea C [M]. Define 
the mixed zero-S norm || - ||o,5 of a vector, for the partition S = {S;,..., Ss}, as the 
number of subsets including at least one index related to a non-zero entry of x,! that 
iS 


S 
Ixllos = )— 1(Ixs,ll 49), 


i=] 


where 1(-) denotes the indicator function. Hence, the norm ||7)||o,.5 counts the number 
of subsets altered by an attacker. Moreover, we can guarantee that only one subset 
stays active by including this as an additional constraint in (3), yielding 


min L(x) St. [lNIlo < €, Ilnllos = I. (12) 
As a remark, the mixed norm ||.||o,5 1s extensively used in signal processing and 
compressed sensing to promote group sparsity [57]. Ina similar manner as in Sect. 2.1, 
we employ the approximation L ¢(x, 7) © L¢(x, 1) + (7 — 7)' VL ¢ (X, 7) which 


yields the following linear programming formulation of (12) as 


ming! VnL (x, 7) st. [Inlloo <€. Iinllo.s = 1. (13) 


‘Similar to the so-called €g-norm, this is not a proper norm. 
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Given a subset S, we define 7), as 
n, = argmin, V,,L ¢(x, 7)'17 8.t. |IMlloo < €. (Miz = 0 VE ES,. 
Note that we have a closed form solution for 77,, that is 
7. 
Ns = —€ )_ sign((VyL p(%, )iNeE » 
z=l 


which implies that V,,L ¢(x, 7)'n, = — aa |(VinL ¢ (x, 1) diz 
has the closed form solution given by 





. Then, this problem 





Z 
n* =,,, with s* = argmax, >, (Vin ¢ (x, diz (14) 


Z=1 


Iterative Methods and Randomization In the previous section, we summarized dif- 
ferent versions of the problem of generating adversarial perturbations. An overview 
of these methods is shown in Table 1. For this section, we will work with these solu- 
tions to design adversarial perturbations using iterative approximations. Since the 
principle behind the approaches in (6) and (9) is the same, we will only focus on (6). 
Nevertheless, it is trivial to extend the algorithms presented in this section to use (9) 
instead. 

In Algorithm 1, an iterative method based on (6) is introduced. This iterative 
version of (6) resembles a gradient descent method for minimizing L (x, 7) over 
7 with a fixed number of iterations and steps of equal €,-norm. For that purpose, a 
set of parameters €,,...,€7 iS required to control the norm of random noise used 
for dithering. There is no dithering if €; is set to zero for alli = 1,..., 7. The 
well known PGD attack uses dithering by applying it at the initial iteration. In this 
attack €) = --- = €7 = 0, and random(€,) generates a random vector with a uniform 
distribution over the £,,-ball of radius €; (centered at 0). 


Algorithm 1 Iterative extension for £, constrained methods. 


input: x, f(-), T, 6 €1,..., €7. 

output: 7°. 

Initialize 7, < 0. 

fort =1,...,7 do 
1) <7, + random(€) 
ni < argmin, 1'VnL f(x, 7;) 8.t. IMl|p < €/T (Table 1) 
M1 —™ +1 

end for 

return: 7* <— nr; 
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Table 1 Summary of the obtained closed-form solutions 


Type of attack Relaxed problem Closed-form solution 
b[tacosmind SSD 
Single-subset attack (13) (14) 


Table 2 Recovering existing attacks in classification using this framework [8] 


FGSM [24 
Deep! (#7 
BIM (35 . 
PED (43) | Cronsentiopy 
aM SCS 
Regression [8 / 


Finally, computing 77° with the additional constraint ||7)||o,s = 1 in Algorithm 1 
leads to a multiple subset attack. For such an attack one must additionally subtract 
previously modified subsets from S. This results in a new subset being altered at every 
iteration. Similarly, given a class label / € [K], changing the objective function to 


L (XM) = fiw (KX +) — ffx +7) 


leads to a targeted attack, that is when the objective is to apply perturbations such 
that the outcome of classification is always some “target” class /. 

As we can see, different configurations for Algorithm 1| lead to known adversarial 
attacks from the literature. A summary of Algorithm | configurations with their 
corresponding attack from the literature is presented in Table 2. In classification, 
these methods are usually compared using the fooling ratio, that is the percentage 
of correctly classified inputs that are misclassified when adversarial perturbations 
are added. Visualizing the fooling ratio for different values of € is often used to 
empirically asses the performance of an adversarial attack. For example, in Fig. 2 we 
observe the fooling ratio of different attacks on standard DNNs (not trained to resist 
adversarial attacks). 


Regression Problems and Other Learning Tasks. The objective 


Lex = —|faXtM — fK+ Mlle 


can be used to attack regression problems as well. This possibility is investigated in 
[8], where the objective is to perturb the output of a regression model as much as 
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(a) LeNet-5 (b) DenseNet-40 


Fig. 2 Fooling ratio, from [7], of different adversarial attacks on vanilla DNNs on the MNIST 
dataset. a 5-layered LeNet architecture from [37], b DenseNet architecture from [30] with 40 layers 
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Fig. 3. Adversarial examples for regression [8]. a MNIST autoencoder, b STL-10 colorization 
network 


possible. Two examples are provided using autoencoders and colorization? DNNs 
in Fig.3. In that figure, we observe how adversarial perturbations heavily distort 
the outcome of regression. Using the principles explicated in this section, other 
algorithms have been developed for attacking other types of learning systems. In 
the field of computer vision, [28] constructed an attack on image segmentation, 
while [76] designed attacks for object detection. The Houdini attack [12] aims at 
distorting speech recognition systems. In addition, [53] tailored an attack for recurrent 
neural networks, and [40] for reinforcement learning. Adversarial examples exist for 
probabilistic methods as well. For instance, [33] showed the existence of adversarial 
examples for generative models. For regression problems, [70] designed an attack 
that specifically targets variational autoencoders. 


Robustness metrics. Going back to the definitions of Sect.2.1, Theorem 1 shows 
that given a vector x and a score function f(-), the adversarial perturbation should 


have at least £,-norm equal to eS to fool the linearized version of f(-). In 
n ¢ (x,0) ||, 


* A colorization model predicts the color values for every pixel in a given gray-scale image. 
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Table 3 Experiment from [7] showing the robustness measures for different DNNs on the MNIST 
and CIFAR-10 datasets. The acronyms FCNN denotes a standard fully connected neural network, 
while NIN refers to the network-in-network architecture from [39]. LeNet-5 and DenseNet are the 
same architectures used in Fig. 2 


DenseNet (CIFAR-10) « = 0.010 


L ¢(x,0) 

. Vn ¢ &0)Ilq e . e . eqe 
with ¢,-attacks. In that sense, Theorem 1 provides an insight into the stability of 
L ¢(x,0) 
Vn Ls, Ollq _ 
adversarial robustness. Moreover, one can also include dithering by regularizing 


other words if the ratio is small, then it is easier to fool the network 


classifiers. Therefore, regularizing the loss function with may lead to 


; L /(x,7) ; a 
with WeL;@iDl, with some randomly chosen 77. 


In [47], the authors suggest that the robustness of the classifiers can be measured 


as 
| A 


(Di Ixll, 


where D denotes the test set and r(x) is the minimum perturbation required to change 
the classifier’s output. Proposition 1 suggests that one can also use the following as 
the measure of robustness 


a0 ee eae 


The lower /2(f), the easier it gets to fool the classifier and therefore it becomes less 
robust to adversarial examples. According to the experiments in [7], shown in Table 3, 
these two robustness metrics seem to be coherent when measuring the robustness of 
non-adversarially trained DNNs. 


2.2 Black-Box Attacks and Universal Adversarial 
Perturbations 


So far we have assumed that the adversarial attacker has perfect knowledge of the tar- 
get classifier function f (-) as well as the input x. By loosening of these requirements, 
into more realistic assumptions, new types of algorithms arise, namely 
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— black-box attacks: these methods correspond to the settings where the target 
classifier f(-) is unknown but the input x may still be known to the attacker, 

— universal adversarial perturbations: these perturbations are designed to work 
regardless of the input x, which is assumed to be unknown. Nevertheless, the 
classifier f(-) may be available to the attacker. 


If both the target model f(-) and input x are unknown to the attacker, the adver- 
sarial attack would be a black-box as well as universal adversarial perturbation. 
These types of attacks are still possible by assuming partial or indirect knowl- 
edge about the input x and the classifier f(-). For example, the attacker may have 
access to a set of independent realizations {x,, X2,...} of the input, which provides 
knowledge about the input distribution Py. Similarly, implicit information about 
the classifier can be inferred by observing the independent realizations of the pairs 
{(x1, f (X1)), (Ko, f(X2)), ... }. Finally, we may also have knowledge about the struc- 
ture (number of layers, types of connections, activation functions, etc.) of the used 
classifier. 

It is probably unexpected that some adversarial perturbations produce the same 
effect over different inputs and different DNNs architectures, although they are gen- 
erated for a particular model. These universal adversarial perturbations are reported 
in [24, 57] where the authors show the existence of such perturbations for vari- 
ous datasets and DNNs. This phenomena suggests that there exist certain common 
properties shared by adversarial perturbations that account for most of the success 
when attacking a system. This can explain why certain perturbations are able simul- 
taneously fool a target DNN on different inputs. Adversarial examples are indeed 
transferable. In [72] the authors construct an attack such that adversarial examples can 
transfer from one random instance of a neural network to another. Surprisingly, these 
methods were proved to be effective against well known DNNs. Since no explicit 
knowledge about the DNN weights is required to compute these perturbations, they 
can be thought of as black-box attacks. Moreover, the authors showed that including 
such black-box adversarial examples into the training set significantly enhances the 
robustness of neural networks. Finally, the authors in [45] showed that there exist 
adversarial examples that are both universal and black-box, that is perturbations that 
are independent from target DNN and input. 


Black-Box Attacks. As discussed, in the black-box setting the classifier function f (-) 
is unknown, thus we cannot compute the gradient necessary for Algorithm |. A com- 
mon approach to circumvent this issue 1s to estimate the gradient by choosing a sub- 
stitute model f which is hoped to behave in a similar way as the unknown f(-). This 
concept is introduced in [51] under the assumption that the input x is known, as well 
as n independent realizations of (x, f(x)) denoted as (x, f(X1)),..., (Xn, f(Xn)). 
This method consists on the following two steps: 
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1. Train a substitute model f that predicts f(x), thus it resembles the target classifier. 
2. Perform a white-box attack on the substitute model f and hope it transfers to the 
target model f(-). 


This concept is later extended in [41], where the authors make use of several substi- 
tute models, that is rie satoes 2 for r > 1. In that work, adversarial perturbations are 
computed by approximately solving the following optimization for an ensemble of 
loss functions: 


min — log (> aL 7 (x, 0) + Ally|lp 


i=] 


where A>0, p>1,0<aq; < 1, > a; = 1 and L(x, -) is some positive loss 
function like the cross-entropy loss of f(-) at the point (x + 77). The key idea of this 
method is that a perturbation 77 that is able to fool the classifiers fi, aes a will most 
likely fool the unknown classifier pone = f as well. Note that it is also possible to 
generate norm-constrained versions of this method by approximately solving 


min — log (> aL p(X, 0) s.tIImllp <¢ (15) 


i=l 
using the same methods described in Table 1. 


Universal Adversarial Perturbations. Given 0 < 6 < 1, the paradigm of designing 
universal adversarial perturbations u can be summarized as follows 


Find: u 
S.t.||ul|, < € 
Py(k(x +u) # k(x) > 1—O. 


Note that in order to approximately solve this problem one needs information about 
the distribution of the input. A common assumption when designing universal pertur- 
bations is that the attacker has perfect knowledge of the classifier k(-), but only partial 
knowledge about P, in the form of n independent realizations 4, = {x,,..., X,} of 
>, eee ee 

This problem is first approached in [45] by iteratively aggregating the perturba- 
tions that move x), ..., X, to their corresponding decision boundaries. Given an input 
x;, such perturbations are computed by iteratively solving (9) in the same manner as 
in Algorithm 1. Then, in order to preserve the €,-norm constraint, these perturba- 
tions are projected into the ¢, ball of radius «. A summary of this method is shown 
in Algorithm 2. In addition, some example pictures showing the effectiveness of 
this algorithm are shown in Fig.4. Note that this algorithm does not converge for an 
arbitrary choice of 0, thus additional stopping criteria are needed. 


Adversarial Examples in Deep Neural Networks: An Overview 47 


Algorithm 2 Universal adversarial perturbations with constrained £,,-norm. 
input: x;,...,X,,k(-), €, 0. 
output: u*. 
Initialize u* < 0. 
while Pyex, (A(K + u) # k(x)) > 1-06 do 
Shuffle %;,, 
fori =1,...,ndo 
if k(x; +u*) = k(x;) then 
Compute the minimal perturbation that sends x; to the decision boundary: 
1* <— argmin,, ||7\I2 s.t. k(x; + u*) A k(x;) 
Project the perturbation 1* into the €, ball of radius e€: 
u* <— argmin, ||ju* + 7* — ull2 s.t. |lul|p) < 
end if 
end for 
end while 
return: u* 


In a similar fashion as in (15), a universal adversarial perturbation can be obtained 
by minimizing and ensemble of objective functions, that 1s 


nN 
min ) Ls (i,m) 8 lInllp <6 
i=1 


where L (x, -) is some objective function as in (2). Choosing Lf to be L ¢(x, 7) = 
If) — f(+m)|> and using the approximation || f(x) — f(x + M)llp ~ 
| 7 ¢(X)7||», where J¢(x) denotes the Jacobian matrix of f(-) at x, the method pro- 
posed in [32] is obtained. This method is based on the insight that a perturbation that 
manages to fool several known inputs will most likely fool an unknown one as well. 


3 Theoretical Explanations of the Nature of Adversarial 
Examples 


Among various theories regarding the nature of adversarial examples, two directions 
can be singled out. One line of research focuses on local properties of classifiers, for 
example, decision boundaries of classifiers and their geometric properties. A notable 
example is the linearity hypothesis, proposed by the authors in [24], where the exis- 
tence of adversarial images is attributed to the approximate linearity of classifiers. 
Another line of research tend to explain such phenomena by means of global prop- 
erties of classifiers such as the topological dimension of their feature spaces or the 
sparsity of weight matrices in DNNs. We present some of the most important results 
along these two lines. 

There is, however, another theoretical question raised in the literature. After some 
experimental results witnessed a seemingly opposing relation between adversarial 


A8 E. R. Balda et al. 

















Lili} Joystick Chihuahua 
rt ¥ +) AAUAan Us: 
Grille r+) Jay 


Thresher Ly Labrador 


_ Flagpole Labrador 





Tibetan mastiff rf) Tibetan mastiff 


Lycaenid Ch) Brabancon griffon 5 


Fig. 4 The authors in [45] add a universal perturbation (center image) that is able to mislead the 
classification of several images 


robustness and generalization, researchers discussed formally the connection between 
generalization properties of DNNs and their adversarial robustness. The central ques- 
tion is whether adversarial robustness is realized only at the cost of worse general- 
ization. This question is the subject of the last part of this section. 
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3.1 Linearity Hypothesis and Curvature of Decision 
Boundaries 


How can the existence of adversarial examples be explained? The experimental 
results showed that adversarial examples are also misclassified by neural networks 
trained either on a different dataset or with different hyper-parameters. Therefore, 
this phenomenon cannot be attributed to overfitting to a particular model. It 1s first 
conjectured in [24] that, the adversarial examples exists because neural networks 
are well approximated, locally, by linear classifiers. This hypothesis, known as the 
linearity hypothesis, is supported by easy generations of adversarial examples using 
first order approximation of neural networks. 

For deep neural networks, this claim is mainly experimentally substantiated. The 
attacks applied to the linear approximation of neural networks around an instance 
manage to effectively generate misclassified examples, and therefore, the effective- 
ness of first-order approximation attacks testifies, according to [24], to the linearity 
of these models. 

To elucidate this claim, consider a linear classifier for binary classification tasks 
with the parameter w € R™. The classification rule is given simply by sign(w'x). 
In this linear setting, the FGSM provides the best €,..-bounded adversarial perturba- 
tion, which is given by 77) = —esign(w). The total perturbation caused at the output 
is 7'w, which is equal to —e||w||;. The value of ||w]|; can be as large as JM for 
w’s with unit norm. Therefore, a small €,.-perturbation can be blown up by /M. 
Two conclusions can initially be drawn from this example. First, small input pertur- 
bations can incur large output perturbations for high dimensional linear classifiers. 
The authors in [24] argue accordingly for the existence of adversarial examples. 
High dimensional approximately linear classifiers can blow up small perturbations 
at their output. It is, according to [24], the approximate linearity of Convolutional 
Neural Networks (CNNs) that explains the existence of adversarial examples for 
the ImageNet classification problem. Although CNNs have many non-linearities, 
the parameters of CNNs after training are chosen so that the non-linearity of the 
architecture 1s diminished. 

The second conclusion points to the £;-norm of the parameter w as the key to 
control the effectiveness of £,.5-attacks, which can be used to design robust neural 
networks as we will see later. In general the €,-norm of w controls the output pertur- 
bation for £ ,-attacks where (p, q) are dual to each other.’ However, this example only 
shows that for some linear classifiers, small input perturbations changes significantly 
the output. 

The linearity hypothesis did not remain unchallenged. After all, the non-linear 
machine learning algorithms were equally vulnerable to adversarial examples. In 
[59], adversarial examples were generated by imposing a similarity constraint 
between the hidden representation of the perturbed image in DNN and the hid- 
den representation of an image from a different class. An optimization problem is 


3We call the pair (p, q) dual if the corresponding norms are dual. In particular 1/p + 1/g = 1. 
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used to minimize the difference between hidden representations with constraints on 
the perturbation. The method yielded adversarial images which differed from other 
adversarial images in that they did not rely, even implicitly, on linear approximations 
of the model. They could not be generated from linear approximations. The linearity 
hypothesis cannot explain the existence of these adversarial images. 

For ImageNet classification problem with CNNs, the authors in [42] examined 
the linearity hypothesis by comparing f (77) and f(x + 7) — f(x). These two values 
are equal for linear classifiers, thus it is used to measure the linearity of CNNs. 
The conclusion of this study goes against the linearity hypothesis. The experimental 
studies of these values do not indicate any linear structure for CNNs. According to 
the modified linearity hypothesis, proposed in [42], CNNs are locally linear around 
the objects recognized by the model. The locality assumption is crucial here. Deep 
neural networks are non-linear in general and cannot be replaced completely by a 
linear classifier. However, the local linearity hypothesis claims that these models in a 
neighborhood of an instance can be approximated by a linear classifier. Additionally, 
the CNNs can be non-linear around those instances that are not recognized by the 
model. 

The local linearity hypothesis implies that the decision boundaries around an 
instance can be approximated by a linear boundary. The geometric notion for char- 
acterizing the linearity of a surface is its curvature, as decision boundaries for binary 
classifiers are surfaces on higher dimension. The differential geometric notions of 
curvature are complex to characterize for DNNs. Therefore, in [19] the authors came 
up with an alternative, yet related definition of curvature. Consider the decision 
boundary for a binary classification problem* with the classifier f(-) defined as 


B= {x: f(x) = O}, 
and decision regions given by 
R, = {x: f(x) > 0} and R_; = {x: f(x) < O}. 
The curvature of the decision boundary 6 with respect to £,-norm is defined by 


1 
Kg(B) = on where 


‘nin = inf min SUP {||Xo =: X\lq ; B, (Xo, I|Xo _ X||,) c Ri} 
xeB ie{-1,1 x, ERM 


and B,(x, €) denotes the €,-ball of radius € centered at x. In other words rin 1s 
obtained by first finding at each point x, on the decision boundary, the largest radius 
of £,-balls that contain x while being contained in 7e; and 7_,. The radius is infinity 
for all g => 1 when the decision boundary 1s flat, which means that the local curvature 


“In this section, we focus mainly on binary classification examples assuming that the results can be 
extended without particular difficulty to multi-class classification problems. 
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is equal to zero at this point. The minimum of such radii for all x, that 1s 7min, points 
at the most curved portion of the surface. The global curvature of the surface 6 is 
the inverse of /in. A linear classifier yields decision boundaries with zero curvature, 
and a small curvature surface might be completely flat in most of its points. It turns 
out that the €,-curvature &,(6) determines the robustness against ¢,-attacks with 
(p,q) as a dual pair. 

It is based on this notion of curvature and for £2-attacks that the authors in [19] 
compare the robustness of classifiers to random noise and adversarial noise and 
characterize it according to the curvature of decision boundaries.” The random noise 
is modeled as a random direction and the robustness against random noise at x, 
denoted by py (x), is defined by the minimum £ -norm of a random vector required 
to change the label of x. The adversarial robustness, denoted by p(x), is the minimum 
€y-norm of a perturbation particularly designed to change the label. 


Theorem 4 ({19, Theorem 2]) Suppose that for a binary classifier the curvature 
K2(B) satisfies 
0.2 


ROD) = Sea 
G6) Mp) 
then with probability at least 1 — 40 it holds 


(1 — 0.625M p(x) k2(B)O(5)) MG (6) < oo” 


< (1+ 2.25M p(x) k2(B)G(0)) VMG(0), 





where 


(6) = (1+ 2Vin 75) + 2Inc1/6)) 
6O= (max ((1/e)5”, 1-20 — ®))) | 


The theorem implies that if the curvature of decision boundaries are small enough, 
the robustness of random noise is scaled with 1/./M of the adversarial perturbation. 
In other words, in higher dimensions, classifiers with flat boundaries can be robust 
to random noise even if they are not robust to adversarial examples. Note that the 
curvature of non-smooth boundaries can be huge, rendering the above theorem non- 
informative. The curvature for multi-class classification, however, is characterized 
by the curvature of pairwise boundaries which does not include high curvature junc- 
tions. The intersection of these boundaries might have high curvature, but this issue 
does not matter in the above theorem where the pairwise boundaries are considered. 
Although only for €2-attacks, this result is extended to ¢,,-attacks in [21], where the 
small curvature condition is replaced by a condition called locally approximately flat 
decision boundaries. 


>They consider semi-random noise as well, however, we restrict ourselves to simple random noise. 
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The geometric properties of decision boundaries are further investigated in [46] 
for universal adversarial perturbations. Universal adversarial perturbations exist for 
both flat and curved decision boundaries. The essential to the existence of universal 
adversarial perturbations are shared directions along which the surface is positively 
curved. The above results compare random noise and adversarial perturbations. The 
result rely on low curvature assumptions of decision boundaries. The conclusion can 
be put with a flavor of blessing of dimensionality. That is, the locally flat classifiers 
are more robust to random noise particularly in higher dimensions. 

The local linearity assumption is further refined in [16]. The flatness of decision 
boundaries can be violated in some directions, and nevertheless, the adversarial 
vulnerability persists. The authors show that the adversarial examples can exist if 
the boundaries are flat along most of the directions and highly curved only around 
few directions. This claim is additionally supported by [20, 55] where the curvature 
profile of deep networks are numerically characterized and is shown to be highly 
sparse, which implies that the boundaries are not flat overall but effectively only 
along most directions. 

It is worth to finish the discussion around the linearity hypothesis by referring 
to a recent result that sheds some doubts and raises some questions about the role 
of linearity in adversarial robustness. Contrary to all above claims, the work [46] 
shows that the adversarial training, a powerful and consistent defense against adver- 
sarial attacks, leads to significant decrease in the curvature of loss functions. The 
connection runs in both directions as training DNNs with curvature regularization 
tends to improve the adversarial robustness. As long as the curvature of loss func- 
tions affects the curvature of decision boundaries, the result stands out as a strong 
counter-argument for linearity hypothesis and opens new challenges for it. 


3.2 Boundary Tilting and Other Explanations 


The linear hypothesis is not the only available theory. Other theories attribute adver- 
sarial robustness to other features of classifiers. A simple intuition already emerged 
from our discussion of linear classifiers. The @,.-attacks for linear classifiers with 
unit-norm parameters w generated a perturbation equal to —e€||w||;. Therefore, among 
all unit-norm w’s, the most robust classifiers are those with smallest 2;-norm, which 
are also the sparsest possible vectors. 

For binary classification problems, the authors in [25] showed theoretically that 
the adversarial robustness decreases when ¢;-norm of w increases. We introduce 
some definitions before stating the theorem. Suppose that the instances and labels 
(x, y) follow a distribution P,,,. The adversarial robustness for this probabilistic 
model is defined as 


Po = Px yly # sign(w! Xaav)], 


where Xaqy 1S the £,.9-perturbed instance and given by Xaqy = X — eysign(w). Let us 
define pz, fork € {1, —1} as follows 
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py, = E(xly =k, sign(w'x) =k). 


We can now state the theorem. 


Theorem 5 ({25, Theorem 3.1]) For a binary classification problem with uniformly 
distributed labels, if the accuracy of a classifier is given by t then the adversarial 
robustness Px against € bounded €,, attacks is given by 


z tw! (wy, — [t_;) 
wo — ° 
2e€||wlli 


The denominator of the bound on the right hand side contains the £;-norm of w. 
Therefore, as the authors in [25] maintain, among those linear classifiers with a similar 
discriminatory capability, those with the smallest €,-norm perform better under ¢.,- 
attacks. The small €,-norm implies a larger p.., which means better robustness. 

The theorem, however, provides only an upper bound, and, although it can char- 
acterize the negative effect of large €,;-norm on robustness, it cannot necessarily 
guarantee that the small €;-norm promotes robustness nor that small £;-norms nec- 
essarily lead to sparse w. The claim, however, seems to hold, as experimental findings 
seem to support the idea that the sparsity of weights promote adversarial robustness. 
Besides, since the difference jz, , — 4_, iS independent of the norm of w, the inner 
product w! (yu 4, — M_,) Scales with the norm of w. In this light, another reading 
of this theorem suggests that among all unit €;-norm w’s, the one with smallest 
w! (pu 41 — H_1) restricts robustness the least. To summarize the theorem, a first step 
toward robustness of linear classifiers is to find the smallest €,-norm w for which the 
inner product w! (yu 41 — H_,) is high enough. 

Another explanation of adversarial examples, introduced in [71], starts from the 
assumption that the data lies on a low-dimensional manifold in higher dimensional 
space, and many classifiers exist with similar accuracy. This is shown in Fig. 5 using a 
simple example of linear manifolds and linear classifiers. We assume that the data lies 
on a linear subspace and the dashed line represents the boundary of an optimal Bayes 
linear classifier for the data distribution with zero error. However, the rotated versions 
of this linear classifier, for example the one with the solid line as the boundary, yield 
the same accuracy. The main difference between these classifiers is their robustness 
to adversarial examples. If the linear boundary 1s tilted so that it lies close to the data 
subspace, the smaller €,,-norm perturbation can fool the classifier. This can be seen in 
Fig.5 as the €,.-ball touching the tilted classifier is smaller that the original not-tilted 
classifier. This is known under boundary tilted hypothesis. Under this hypothesis, the 
adversarial vulnerability of classifiers arises from the tilted classification boundary 
close to the data manifold. A further exploration of the linear classifier example can 
reveal that some of the tilted boundaries can indeed improve the robustness. 
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(a) Not-tilted boundary (b) Tilted boundary 


Fig. 5 Adversarial robustness of tilted boundaries: a the dashed line is the ground truth linear 
classifier for the data supported on -V, b the solid line, a tilted boundary, yields the same risk as the 
ground truth but is fooled with smaller €..-perturbations 


3.3 Feature Selection and No Free Lunch Theorems 
for Adversarial Robustness 


Many classifier functions can be clearly decomposed into feature extraction and 
classification parts. In [74], adversarial robustness is shown to be affected by the 
feature selection part of the model. The results of [74] rely on the assumption that 
there is an oracle classifier function g(x) that generates the ground truth labels. 
For image classification problems, it is simply the human eye. Classifiers f(-), in 
particular g(-), are decomposed into a feature extraction part e,(-) and a classifier 
part cy(-). The feature spaces of a classifier is the image of the domain set + under 
the feature extraction e ;(-). Feature spaces are assumed to be metric spaces. Denote 
the oracle feature space by (+,, d,) where d, is the respective metric, and (1, dy) 
similarly for a classifier f(-). 

Adversarial perturbations do not change the oracle decision, 1.e., g(x) = g(x + 7) 
nor the feature extraction: 


dg (€g(X), @g(X +7)) < 0. 


However, the classifier is fooled (f(x) # f(x+7)). A classifier is called (€, 0)- 
robust if for all x, y €¢ 4 for which g(x) = g(y) and d,(e,(x), e,(y)) < € then with 
probability at least 1 — 0 it holds that f(x) # f(y). 


Theorem 6 ([74, Theorem 3.2—3.4]) Let a classifier f(-) be continuous almost 
everywhere and g(-) be the oracle classifier. The classifier f (-) is (€, 0)-robust to 
adversarial examples if and only if the topology of the feature space (X, d+) is finer 
than the topology of the oracle feature space (X,, dy). 
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As a direct corollary of the above theorem, when the two features spaces are 
Euclidean spaces of dimension n, and n-, then the classifier h(-) is robust if and 
only if n¢ <n,. The above theorem implies first that the selection of features and 
feature spaces is crucial for adversarial robustness. Although the assumption of an 
oracle function and a unique suitable feature space can be contested, the theorem 
applies to any two classifiers and states that if a perturbation does not fool g(-) 
then it does not fool f(-). So among the classifiers those with feature spaces of 
finer topology, or lower dimension in Euclidean spaces, are favored for adversarial 
robustness. 

The importance of selecting proper features is addressed in other works such as 
[17, 18]. A toy example is used in [17, 18] to show that linear classifiers are unable 
to use more robust features of an image for adversarial robust classification, unlike 
quadratic classifiers that are more robust in that example. For €2-attacks, the authors 
point out that the adversarial robustness is directly related to the so-called distin- 
guishability measure of classes and the risk of the classifier. The distinguishability 
measure can be seen as low flexibility of classifiers in general compared to the dif- 
ficulty of the classification task. We state a simplified version of their theorem for 
linear classifiers using the definition of p(x) from Theorem 4. 


Theorem 7 ([{17, Theorem 4.1]) For a binary classification task with uniformly 
distributed labels and \|x||2 < B a.e., the adversarial robustness of a linear classifier 
sign(w!x) with accuracy t satisfies 


] 
E(p®)) < 5A.) — Ex, lo + 280. 


The distinguishability measure ||Ep, _.,(x) — Ep,,__,(x)|l2 1s a feature of clas- 
sification problem and not dependent on the classifier. However, an unexpected 
conclusion of the theorem is that if the classification task is difficult, in that the 
distinguishability measure is small, the risk of the classifier becomes dominant in 
the upper bound and inversely related with the robustness. Therefore, low risk clas- 
sifiers have less adversarial robustness for difficult classification tasks. 

The inverse connection of risk and robustness is further explored in [73] through a 
binary classification example. An instance of data is given by x = (X1,..., Xn, Kai 
and it is related to its label y randomly as follows. The first entry is a Bernoulli 
random variable with P(x; = y) = p and the other entries, x;, are normal dis- 
tributed random variables with mean value €y and unit variance. A linear classifier 
with w = (0, 1/n,..., 1/n) can be shown to achieve more than 0, 99 accuracy if 
€ = O(1//M). However, this classifier can achieve an adversarial accuracy at most 
0.01 under the £.,-attack with « = 2€. However, if one uses only the first feature 
x, for the classification both standard and adversarial accuracies are 0.7. The data 
consists of, on the one hand, robust features with less accuracy and, on the other 
hand, informative and non-robust features. We might ask whether this tension can be 
circumvented using a smart combination of features so that the adversarial robust- 
ness does not come at the price of accuracy. The authors of [73] answer negatively 
by stating a no free-lunch theorem for adversarial robustness. 
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Theorem 8 ([{73, Theorem 2.1]) Any classifier with standard accuracy at least 1 — 6 
on the above problem cannot achieve adversarial accuracy more that =P 5 against 
lo0-bounded perturbations with ||1\|o9 > 2€. 


The importance of feature selection for adversarial robustness is highlighted 
already in [73]. A similar result is obtained in [15] for a class of data distributions 
satisfying W-Talagrand transportation-cost inequality. The condition is intuitively 
related to the curvature of decision regions and matches the previously mentioned 
intuition that low curvature decision boundaries entail adversarial vulnerability. From 
another perspective, «-bounded ¢,-adversarial attacks manage to alter the label of 
instances that are included in the € €,,-boundary of decision regions. The adversarial 
problem so formulated, naturally, can be cast as the study of blowing-up property of 
decision regions—a problem well studied by concentration results and isoperimetric 
inequalities. A similar approach is followed in [14, 22, 44] for instances following 
Gaussian distribution and uniform-distribution over hypercubes. 


3.4 Generalization Bounds for Adversarial Examples 


The no-free-lunch-theorem states that adversarial robustness of machine learning 
algorithms does not align in general with their risk. Adversarial training of DNNs, 
indeed, confirm the same point that adversarial robustness is obtained at the cost of 
degraded generalization. According to an example in [43], the adversarial accuracy 
of 96% for ResNet dataset trained on CIFAR-10 comes with a test accuracy of 47%. 
In [67], the authors attribute this trade-off to the definition of adversarial robustness. 
They propose another definition of adversarial robustness for which no trade-off is 
observed between adversarial robustness and accuracy. 

From these indications, therefore, emerge questions regarding generalization 
properties of adversarially robust algorithms. We summarize some of the progresses 
in that direction that use statistical learning theory as the framework for studying 
generalization properties of learning algorithms. An interested reader can refer to 
the excellent manuals [2, 64] for introduction to fundamental notions of statistical 
learning theory. 

Statistical learning theory explains generalization properties of many classical 
learning algorithms using notions like VC-dimension, Rademacher complexity and 
uniform convergence. A similar approach, however, cannot be directly applied to 
understanding generalization for neural networks. The large number of parame- 
ters in DNNs renders many of these bounds ultimately useless as these models are 
capable of fitting arbitrary large number of random instances with random label- 
ing [79]. A solution is to use a suitable normalization. It has been shown that the 
margin based normalization can be used to obtain tight generalization bounds by 
Rademacher complexity or PAC-Bayesian methods that match experimental results 
[3, 10, 23, 49]. 
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We already discussed some works relating adversarial robustness and accuracy, 
all of them derived for a class of data distributions. In this section, we focus on 
sample complexity bounds for adversarially robust generalization and see whether 
the available bounds attest to difficulty of training adversarially robust and yet accu- 
rate models. A first indication in this direction can be traced to [63] where it is 
shown that for Gaussian model of data, the sample complexity of robust learning for 
M-dimensional data is © (/M ) times larger that standard learning, hence, more diffi- 
culty of the former. The gap is information theoretic. PAC-learning for the adversarial 
setting is an open problem, although some bounds exist for binary linear classifiers 
obtained in terms of VC-dimension [13]. The former result, however, show no neg- 
ative effect of robust training on generalization, counter to the above intuition. A 
generalization bound is also obtained in [6] when the set of adversarial perturbations 
is finite. The sample complexity of binary classification depends k log(k) VC(H) 
where VC(7) is the VC-dimension of the hypothesis class 1 and k is the number of 
different adversarial perturbations. The result is not directly applicable to standard 
adversarial attacks where the set of possible perturbations is not finite, however, it 
points to the larger sample complexity of robust learning. 

There is a difficulty with VC-dimension bounds when they are applied to DNNs. 
As it is explained in [77], VC-dimension bounds depend usually on the number 
of model parameters, which is very large for DNNs. The corresponding sample 
complexity bound becomes unreasonably large beyond typical available datasets. 
Rademacher complexity bounds, on the other hand, depend mostly on inherently 
smaller quantities like the norm of weight matrices and, therefore, are more appro- 
priate for establishing generalization property of DNNs. 

Rademacher complexity bounds for adversarial robustness 1s obtained in [31, 77]. 
They use different techniques for deriving their bounds and have different scope of 
applicability. Nevertheless, both works contain Rademacher complexity bounds for 
binary and multi-class classification and are applicable to neural networks. Both 
works use surrogate adversarial loss. In particular, the authors in [77] build on 
semidefinite programming (SDP) relaxation techniques of [56]. We do not expand 
on the technical results and, instead, state qualitatively some implications of these 
results. 

As we discussed above shortly, Rademacher complexity bounds depend mostly 
on the norm of weight matrices. However, the lower bound on Rademacher complex- 
ity of neural networks for robust training in [77] has additional dependence on the 
dimension for the £.,-attacks. The dependence disappears only if the weight matrix 
of the first layer has bounded €;-norm. In absence of this assumption, this bound 
confirms the hypothesis that robust training is more difficult than standard train- 
ing. The technique employed in [31], however, yields upper bounds in which the 
effect of adversarial perturbations appears as an additive term in the generalization 
bound. The authors, therefore, conclude that it should not be impossible to obtain 
both high adversarial robustness and high accuracy. Although a final verdict seems 
to be far reaching at the moment, new regularization techniques arises from these 
generalization bounds that can be used during training for robust learning with high 
accuracy. 
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4 Defenses Against Adversarial Attacks 


There exist several types of defenses against adversarial examples, as well as subse- 
quent methods for bypassing them. It is difficult to point out, at the time of writing 
this chapter, a consensus on the effective defense against adversarial examples with 
the possible exception of adversarial training. For instance, the authors in [11] pro- 
posed three attacks to bypass defensive distillation of the adversarial perturbations 
[54]. Moreover, the attacks from [5], bypassed 7 out of 9 non-certified defenses of 
ICLR 2018 that claimed to be white-box secure. Adversarial training, however, adds 
adversarial examples to the training set and is the most commonly accepted defense 
against adversarial attacks. In what follows, we discuss some difficulties of adversar- 
ial training as well as those methods that try to promote robustness merely through 
regularization techniques. 


4.1 Obfuscated Gradients and Adversarial Training 


The rise of adversarial perturbations in computer vision has motivated further 
research on defending against such perturbations. To that end several defenses against 
adversarial examples, such as [38, 60, 65], have been designed. Since many adver- 
sarial attacks make use of the classifier’s gradient with respect to some objective 
function, as in Algorithm 1, initial works on adversarial defenses rely on distorting 
and hiding the information about that gradient. In [5], these techniques were said 
to obfuscate the gradient. More precisely, obfuscating the gradient may be done in 
either of the following manners. 


— Shattered Gradients appear when the defense mechanism is not differentiable, 
numerically unstable, or intentionally has misleading gradients. 

— Stochastic Gradients occur when the defense method is based on introducing 
randomness into the prediction. Such randomness is added to prevent the attacker 
from estimating the gradients. 

— Exploding Gradients happen when the defense algorithm consists on recursive 
evaluations of the DNN function. In other words, the output of one DNN evaluation 
is the input of the next one. This type of computation implicitly transforms the 
original DNN into a extremely deep neural network, which may lead gradients to 
explode (or vanish) during inference. 


Despite the apparent success of this type of defenses, in [5] it is shown that such 
mechanisms give a false sense of security. The main reason behind this phenomena 
is that obfuscating the gradients of model does not necessarily increases its robustness 
against adversarial perturbations, instead it prevents specific algorithms to find them. 
In other words, for such models adversarial perturbations still exist; they are just 
harder to find using certain methods. Therefore, although a model with obfuscated 
gradients can be robust against a specific adversarial attack, it may still be vulnerable 
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to others. Using this idea, the authors in [5] provide the following conditions to 
identify models that exhibit this problem. 


— One-step attacks perform better than iterative attacks. 

— Black-box attacks perform better than white-box attacks. 

— Attacks with large « do not reach 100% fooling ratio. 

— Random sampling finds adversarial examples, while adversarial attacks 
don’t. 


If a model satisfies one of the above conditions, it suffers from the obfuscated gra- 
dient problem. Using these guidelines, the authors identified that 7 out of 9 defenses 
accepted to ICLR 2018, that were deemed to be white-box secure, suffered from this 
issue. In addition, the authors fooled those defenses using customized attacks. 

So far, the most successful defenses against adversarial attacks consist of adding 
adversarial examples to the training set. This is known as adversarial training. Ini- 
tial attempts to perform adversarial training using the FGSM proved to suffer from 
obfuscated gradients. This occurs since a DNN trained solely with the FGSM learns 
to shatter the gradients that are in a close vicinity to the data samples, such that 
the gradients used for the FGSM point into misleading directions. While this pro- 
cess may mislead the FGSM it is still vulnerable to other perturbations, for instance 
black-box attacks. To overcome this issue the dithering mechanism proposed for 
the PGD attack is employed in [43], along with large € values® during adversarial 
training. This approach provided diverse sets of random adversarial examples, which 
prevented DNNs from obtaining low fooling ratios by shattering the gradients around 
the data samples. In other words, the randomness of the starting point in the PGD 
attack prevents the model from overfitting the perturbations. 

From these initial findings itis concluded that diversity in the adversarial examples 
used for training is necessary in order to prevent DNNs for overfitting to specific types 
of perturbations. To that end, in [72] the authors include black-box perturbations into 
the training set. This is carried out using substitute models, as well as ensembles 
of these models, in the objective function of white-box attacks (such as PGD) as 
described in Sect. 2.2. 


4.2 Robust Regularization 


Despite the success of adversarial training on promoting robustness, these meth- 
ods either suffer from obfuscated gradients (e.g., when the FGSM is employed) or 
are deemed to be computationally expensive, since iterative methods require sev- 
eral evaluations of the DNN function to compute a single adversarial example. In 
[43] it is observed that adversarial training induces sparsity on the weights of the 
first convolutional filters of CNNs trained with the MNIST dataset. Similarly, the 
authors in [36] observe that adversarial training induces low-rank structures in the 


In that work, the £.o-constraint ||7||o. < € = 0.3 is employed to train models where the input 
values were between 0 and 1. 
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Fig. 6 Reshaped input weight matrix W! © R*9*’*4 of a DNN, from [36], after natural training 


as well as adversarial training with « = 0.05. A simultaneously low-rank and sparse structure is 
observed in the weights after adversarial training 


weight matrices of the DNN, as well sparsity. As an example the authors provide a 
visualization of the weights in the first layer of a DNN, shown in Fig. 6, where the 
simultaneously low-rank and sparse structure of such weights 1s clearly visible. This 
is additionally confirmed by looking at the mutual information between the input 
and the layers. Information theoretically, increasing adversarial robustness coincides 
with decreasing the mutual information that indicates more compression of the input 
in hidden layers. These results serve as motivation for aiming research towards find- 
ing the key properties that lead to robustness of DNNs. The idea is to propose a metric 
for robustness and promote it during training. A common technique for promoting 
specific properties during training is to add a penalty term in the loss function, known 
as regularization term, that penalizes undesired properties of the classifier function. 
Here are some examples of robust regularization. 


— Sparsity: In [25, 75] the authors argue that sparsity of the weight matrices of a 
DNN promotes robustness against adversarial examples. They propose to add a 
regularization term with the sum of the €;-norm of the weight matrices involved, 
which is known to promote approximately sparse solutions. In addition, the authors 
make use of pruning’ to impose arbitrary sparsity levels. 

— Low-Rankness: In [36, 61] it is observed that adversarial training induces low- 
rank structures on the weight matrices of DNNs. Motivated by this phenomena, 
low-rank regularization techniques are proposed. In [61] the authors explicitly 


7Pruning consists of setting to zero smallest weights (in absolute value) of the a given weight matrix, 
thus enforcing a certain level of sparsity. The amount of weights to be set to zero is arbitrarily 
chosen. Usually pruning requires an extra phase of retraining (fine-tunning of the remaining non- 
zero weights) to compensate for the performance degradation caused by the initial manipulation of 
the weights. 
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constrain the rank of weight matrices in the optimization algorithm used for train- 
ing. On the other hand in [36] the nuclear norm of the weight matrices 1s employed 
as a regularization term in the training loss. The nuclear norm of a matrix can be 
written as the €;-norm of its vector of singular values, thus using it as a regulariza- 
tion term promotes sparsity in that vector of singular values (1.e., low-rankness). 

— Norm of the network’s Jacobian: In [50], the authors aim at minimizing the 
€5-norm of the output perturbation, that is || f(x) — f(x + 77)||2. Assuming ||77||/2 < 
€, upper-bounding an approximate of this functional yields 


IF) — f&+ Mlle © Ip nile < ellI Ole. 


Motivated by this result, the authors propose using the Frobenius norm of the 
Jacobian ||J ¢(xX)||p as regularization term to promote robustness. The Frobenius 
norm is an upper bound on the £2-norm of the output perturbation. If it is limited 
during training by proper regularization, it can restrict the £2-perturbations. 

— Curvature: In Sect.3.1 it is argued that low curvature in the decision boundaries, 
as well as in the loss function, are desired properties for robustness. Motivated 
by that discussion, the authors of [48] proposed penalizing solutions with high 
curvature of the loss function around the training data. 


5 Future Directions 


Adversarial examples appear as a potential obstacle for widespread employment of 
DNNs particularly in safety critical applications. An ultimate solution, yet, seems to 
be out of reach even for a simple task of MNIST image classification. Adversarial 
training is the best known defense in this situation that comes with two additional 
problems. It is computationally costly and degrades the generalization of DNNs. 
Future works can see whether the latter problem can be solved by different train- 
ing techniques, or the generalization degradation cost is to be paid inevitably for 
more robustness. There is no consensus on the nature of adversarial examples and 
which features of classifiers play the central role in adversarial robustness. There are 
many indications that occasionally align with experimental results. Recent statistical 
learning theory approaches, however, provide a promising path to address general- 
ization and robustness simultaneously. The ultimate goal of an adequate account of 
adversarial robustness, although not attained so far, constitutes an exciting field of 
research in coming years. 
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Abstract Renewable energy resources have become a fundamental part of the elec- 
trical power supply in many countries. In Germany, renewable energy resources 
contribute up to 29% to the energy mix. However, the challenges that arise with 
the integration of those variable energy resources are various. Some of these tasks 
are short-term and long-term power generation forecasts, load forecasts, integration 
of multiple numerical weather prediction (NWP) models, simultaneous power fore- 
casts for many renewable farms and areas, scenario generation for renewable power 
generation, and the list goes on. All these tasks vary in difficulty depending on the 
representation of input features. As an example, consider formulas that express laws 
of physics and allow cause and effect of otherwise complex problems to be cal- 
culated. Similar to the expressiveness of such formulas, deep learning provides a 
framework to represent data in such a way that it is suited for the task at hand. Once 
the neural network has learned such a representation of the data in a supervised 
or semi-supervised manner, it makes it possible to utilize this representation in the 
various available tasks for renewable energy. In our chapter, we present different 
techniques to obtain appropriate representations for renewable power forecasting 
tasks, showing the similarities and differences of deep learning-based techniques to 
traditional algorithms such as (kernel) PCA. We support the theoretical foundations 
with evaluations of these techniques found on publicly available datasets for renew- 
able energy, such as the GEFCOM 2014 data, Europe Wind Farm data, and German 
Solar Farm data. Finally, we give a recommendation that assists the reader in building 
and selecting representation learning algorithms for domains other than renewable 
energy. 
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1 Introduction 


Forecasting information about future power production or consumption 1s attracting 
more and more interest during the last few years, as the energy system in many 
countries starts transforming from a traditional centralized system to a more decen- 
tralized system. This change is taking place because we are introducing a lot of 
variable energy producers and consumers in the electrical grid. To maintain a stable 
electrical grid, we need reliable information about future power production and con- 
sumption. This information allows the planning of energy storage and consumption 
accordingly, so that generation and consumption are balanced. 

In day-ahead! power forecasts, especially in renewable power forecasting, we 
often rely on NWP models and use information, e.g., about future wind speed or 
temperature, to forecast future production and consumption of power. Currently, there 
are predominantly two separate approaches: The first one is to create a physical model 
based on the characteristics of, e.g., a wind farm or a household. The second approach 
is to use machine learning (ML) models, such as artificial neural networks (ANNs) 
or support vector machines, to perform the task of forecasting future consumption 
or production based on NWP data. 

There is an ongoing debate on whether and when it is better to implement either 
physical models or ML algorithms. The advantage of the physical models is that 
they are strongly modeled for a particular type of household or specific renewable 
power plant, such as a wind turbine, and are applicable without utilizing historical 
data. The advantage of ML models is that we do not need to know the particular 
physical characteristics of the renewable power plant, wind turbine, or the building 
we want to model. The ML model learns a function that maps the input data to our 
desired output, 1.e., NWP data to the power time series. Of course, this data-driven 
approach, makes models adapt to (noisy) input data, but reduces or even disables 
the interpretability of such models. However, ML models, especially deep learning 
(DL) models, are capable of producing smaller errors when forecasting power time 
series than their physical counterparts [6]. 

Even though ML algorithms profit from learning a mapping between the input 
and the target data, they still greatly benefit from feature selection and manually 
engineering essential features from the set of all inputs beforehand [4]. Typically, 
humans or so-called filter and wrapper algorithms need to select the best input fea- 
tures for the task. Finding the most important features can be quite slow and often 
involves a lot of background knowledge about the solution of the task one wants 
to solve. Similar challenges arise when feature engineering approaches are used to 
create additional features for the ML tasks by using domain knowledge [4, 30]. The 


‘Power or load forecasts of the upcoming day. 
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whole task of selecting and engineering features is tedious and can take up several 
selection, evaluation, and engineering cycles. Therefore, the overall process is quite 
labor intensive, and requires a great deal of computational capacity. 

Representation learning (RL) tries to overcome these disadvantages as a special- 
ized field of ML that exploits an automatic data-driven feature engineering and feature 
selection approach. During RL we learn latent (or hidden) features. Latent features 
describe the data with sufficient accuracy and often provide the distribution under- 
lying the original input features, which helps the ML task to perform better. Finding 
and constructing the latent features is also referred to as feature extraction [28]. RL 
can be seen as an advancement of manual feature engineering and selection, as we 
do not need to employ domain knowledge to select essential features, or to come up 
with mathematical models describing relations between features. Instead, during RL 
we may employ a deep ANN that learns latent features about our input data. 

A latent feature model obtained with the help of RL often makes it possible to 
further improve those tasks, that use the latent feature representation as input. These 
tasks (using the latent features) do not necessarily need to be forecasting a power time 
series but can also be the classification of images, or prediction of a car’s trajectory 
using different sensory inputs. To present concepts from the field of RL which are 
applicable in the field of power time series forecasting and other domains, this chapter 
aims to: 


— Discuss major challenges in power time series forecasts, 

— present RL for power time series, 

— show the influence of RL on power time series forecasting by introducing evalua- 
tion methods for RL, and 

— provide various examples. 


As this chapter will introduce the general concepts of RL for time series, we do 

not include a comparative study with the traditional techniques of feature selection 
and feature engineering. 
The remainder of this chapter is structured as follows: Sect.2 defines a forecasting 
task based on three examples of power time series. Section 3 explains feature extrac- 
tion in more detail, introducing traditional algorithms as well as deep architectures 
for RL. Section4 proposes several evaluation strategies for RL in renewable power 
forecasts based on the previous mentioned definitions. Section 5 shows several exam- 
ples for RL in power time series forecasting. It utilizes the algorithm and evaluation 
measures to provide examples of RL in power time series forecasting. Section 6 con- 
cludes this chapter and gives some advice on how to apply RL to other ML problems. 
It also provides insights on how to design your RL network and select appropriate 
parameters. 


2 Regression in Power Time Series Forecasting 


This section defines time series forecasts in the context of renewable energies. Chal- 
lenges for those time series are introduced based on those definitions. 
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2.1 Renewable Power Time Series Forecasting 


We use the term power time series for three different types of data here: wind, solar, 
and load time series. The whole process to forecast the targets of the data is typically 
a two-step approach. 

The first step involves forecasting the weather features [5] with a typical time 
step of up to k = 72 hin the future. Some important input features provided by the 
NWP for power time series forecasting are: Wind speed, air pressure, wind direction, 
temperature, humidity, solar irradiance, rainfall, snow coverage, and much more that 
can be selected. Creating these features based on NWP models is computationally 
expensive and that is assumed to be a given in this chapter. The second step maps the 
set of weather features to the generated power of a wind power plant, the load of a 
household, or the output generated by a solar power plant. Forecasting a power time 
series in the second step, e.g., using neural networks, a Support Vector Machine, 
or a linear regression, involves finding a regression model mapping the set of input 
features to the power. 

While the second step can be considered as a “classic” regression problem without 
modelling a specific time dependency, the overall process including the first step is 
referred to as a time series problem [5]. However, in most cases the second step is 
also considered as a time series problem, as it includes so-called time-shifted features 
(explained later in this section) that improve the forecast quality. 

Correspondingly, both kinds of data, the NWP and the generated power, are time 
series consisting of an ordered list of tuples: 


T = {(to, Xo), ---5 (tk, Xk), ---» (th, Xn) }, with (1) 


Xi = (Xo, ---, Xp). 


Each of the tuples consists of a timestamp t € R and a feature vector x € R”, 
which gathers all D data points for that time step. In the case of the power or load 
time series, x is a Scalar, 1.e., D = 1. Incase of an NWP time series, it is an pertinent 
mixture of the items mentioned in the list of weather features. Data samples can be 
equidistant in time or not. For simplicity and a more straightforward explanation, we 
assume the former here. 

In power and load time series forecasting often something called time-shifted 
weather features are introduced. Such time-shifted features are weather features of a 
previous or future time step. Taking future values into account is possible as features 
from the NWP are themselves forecasted. Consider predicting power at time step f,, 
by additionally including weather features (t,_1, X,_1) and (t+, Xx41) the forecast 
error is reduced by introducing the time dependency for previous and future time 
Steps. 

As outlined above, during regression at step two, we try to find a function that 
maps the data of one time series to the data of a second one. In our case, we try to 
find a function that maps the NWP time series to a power time series. As seen in 
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Eq. (2), in the case of a linear regression we want to find a set of weights w that, 
multiplied with the input NWP data x, results in the current power, allowing for a 
particular variance of ¢ [24, p. 19]: 


Power(x) = wixte. (2) 


When doing regression with (deep) neural networks, we try to learn the parameter 
of a neural network to be able to map our inputs to the appropriate output. Neural 
networks blow up the linear regression task of Eq. (2) to several of these regression 
models that are hierarchically combined (including non-linear activation function) 
with different inputs from different layers. With the help of the neural network, we 
are also capable of learning even non-linear relations between the NWP data and the 
power time series, as we can allow for non-linear activation functions, such as the 
logistic function. 


2.2 Challenges of Power Time Series Forecasting 


The significant challenges in power time series forecasting arise due to the aformen- 
tioned transformation of the energy system. The former centralized power grid is 
changing to a more decentralized grid [22]. Those decentralized grids often have 
renewable power plants close to where energy is consumed. This introduces signif- 
icant challenges for the power grids as a whole but also generates new challenges 
regarding the forecasting. 

As we have more and more renewable power plants connected to the grid, we 
cannot rely on forecasting an aggregated power output, e.g., for all wind power plants 
in a region, but we must forecast for each power plant individually. Depending on 
their connection to the power grid, each plant influences a specific part of the local 
grid [18]. These challenges get even more complicated with home mounted solar 
panels, or electric vehicles charged at home. 

Another challenge arises due to smart grids and their smart measuring infrastruc- 
ture [13]. This kind of measuring infrastructure increases the amount of available 
data that needs to be processed by our models, whether they are machine learning 
or traditional models. In addition to data complexity, additional (smart) measuring 
infrastructure allows us to create more detailed models of our power grid. 

With an increased amount of available data and an increased amount of power 
plants that need to be forecasted, we have to think about algorithms that allow for easy 
integration and processing of this vast amount of data. Additionally, the algorithms 
need to be able to adapt the embedded knowledge to different power plants. One 
solution to these challenges can be representation learning and based on top of this 
even multi-task learning. 
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3 Foundations of Representation Learning 


Creating a representation or selecting relevant features such as sensor data from 
complex data can improve the performance of an ML algorithm [2]. The extraction 
and selection of features help the ML algorithm achieve better results for a particular 
task when compared to using the raw data directly [3, p. 268]. The research area con- 
cerned with determining suitable features is called representation learning [2, 14]. 
Typically, the representation is either in a higher dimensional or a lower dimensional 
space compared to the original input. In this chapter, we are especially interested 
in the latter case for deep learning-based methods, as such a representation reduces 
the computational effort once the representation is learned. Also, for some deep 
architectures, a lower dimensional representation is more suitable for the training, 
as discussed in detail later. However, some presented traditional algorithms, such 
as kernel principal component analysis (PCA), first transform the data in a higher 
dimensional space and then reduce it to a lower dimensional space in a second step. 

In the context of ML, the dimensionality of feature reduction is the process of 
extracting or selecting the important features k, out of n input features, where typ- 
ically k is smaller than n. Further, this reduction process aims to find a set of k 
features that are non-redundant and informative. Typically, dimensionality reduc- 
tion techniques are either from the field of feature extraction or the field of feature 
selection, yielding the following advantages: 


— Reducing the computational effort for training the ML model, 

— improving the forecast quality, and 

— limiting the number of features to informative ones that ideally allow for an inter- 
pretation of the model’s behavior. 
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Fig. 1 A basic overview of the data mining process with examples of different methods. The 
methods are annotated with the terms features engineering, selection, extraction, or learning. Note 
that feature reduction can be part of I-I'V 
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Figure | shows a general data mining process annotated with the terms feature 
engineering, selection, extraction, and learning. Some items have one or two terms, as 
some of the algorithms do both feature selection and feature extraction. We describe 
our view on the different terms of feature engineering, selection and extraction, and 
how we apply them throughout this chapter. 

In feature selection, we reduce the number of given features using approaches 
such as filters and wrappers. Both methods select a subset of k features from the set 
of all features n [20]. Filters are, e.g., information theory-based methods that allow 
for selecting relevant features based on entropy. Filters are independent of the ML 
algorithm, while wrapper algorithms are dependent on the ML algorithm and select 
features based on the best evaluation of the algorithm and the current set of features. 

Even though filters and wrappers allow the reduction of the number of relevant 
features, feature engineering is still a crucial concept to obtain good forecast qual- 
ity [4]. In feature engineering, we create features in a way that they help the ML 
algorithm to improve its performance. This engineering is done either by a human 
expert with domain knowledge or automatically by an algorithm. In the latter case, 
it is called representation learning or feature extraction. 

RL, also called feature learning, is a field of methods to derive features that are most 
relevant to an algorithm. Often this process is described as determining latent or hid- 
den features that explain the process that underlies the data. By learning these latent 
features, ideally, superior forecast performance over manual feature engineering and 
filter and wrapper-based methods is achieved. Further, by reducing the number of 
features through RL we reduce computational effort at the same time. 

In areas such as vision and natural language processing deep learning-based rep- 
resentation learning methods improve the forecast quality dramatically compared to 
traditional approaches and manual feature engineering based on domain knowledge. 
Between 2012 and 2015, the classification accuracy on the ImageNet-2012 dataset 
improved from an error rate of 16.4—3.57% by utilizing deep learning-based repre- 
sentation learning methods [1]. Section3.2 explains some of those state-of-the-art 
deep architectures for various domains. 

To compare those deep architectures to traditional dimensionality reduction, we 
give a brief overview of PCA, kernel PCA, and explain the concepts of the wrapper 
and filter approaches in Sect. 3.1 in detail. 


3.1 Traditional Dimensionality Reduction Techniques 


In the following, we highlight some traditional feature reduction techniques. They 
are traditional in the sense that they have been well known for a long time, but they 
are also limited in their potential solutions for the extraction and selection of (latent) 
features compared to deep learning-based methods. 

For example, filter and wrapper methods (Sect.3.1) can only limit the number 
of relevant features but are not capable of extracting (relevant) features that explain 
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the underlying distribution. Respectively, filter and wrapper methods are feature 
selection methods, but not feature extraction techniques. 

In contrast, PCA, (Sect. 3.1) allows latent features to be extracted. These features 
are derived based on the assumption that important features are the directions of 
the largest variance in the original feature space. For additional details on feature 
reduction methods, we refer to [2, 20, 32]. 


Filter and Wrapper: Filters are approaches that select a set of relevant features 
based on a given measure, whereas the size of the set can vary. Typically, they use 
different evaluation measures. Typically, we differentiate between similarity, infor- 
mation theoretical, and statistical measures [20]. After selecting a suitable measure, 
the features are evaluated and ranked according to the selected measure. Afterward, 
the algorithm or the human selects the most relevant features. This process provides 
an interpretable selection of features with small computational effort in compari- 
son to other dimensionality reduction techniques. Therefore, it scales well with the 
number of features. The disadvantages are: 


— Filters often select redundant features, 

— filters ignore the relation to the ML algorithm, and 

— filters only reduce the number of visible features and are not determining latent 
features. 


While filters operate independently from the ML algorithm, wrapper algorithms 
depend on it [10]. In particular, they select the features by iteratively training the 
ML algorithm on 7 subsets of the n features, then evaluate the performance of the 
ML model and select the feature set that performs best. In sequential feature forward 
selection, one starts with an empty set of features. The feature that improves the ML 
algorithm the most is added to the set of relevant features iteratively, until a pre- 
defined number of k features is selected [9]. Other methods use an iterative approach 
starting with the set of all features and then successively removing the least important 
feature. As aresult, the effort rises quickly with the number of features and extracting 
latent features is not possible. 


Principal Component Analysis: PCA is an algorithm that is designed to extract 
orthogonal features that are linearly uncorrelated. Therefore, PCA assumes that 
important features have a high variance. The individual steps of the algorithm are: 


Remove the mean from all features. 

Calculate the covariance matrix of all original features. 

Calculate the eigenvalues and eigenvectors of the covariance matrix. 

Sort eigenvectors by their eigenvalues (highest eigenvalue corresponds to the 
highest variance in the direction of the corresponding eigenvector). 

Select k highest eigenvalues for dimensionality reduction. 

6. Transform the original input data into a new features space using the eigenvectors. 


Po 


a 


The transformed features are also called principal components, and they are the 
hidden features extracted by the algorithm. 
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Fig. 2 Input of kernel PCA 
with two colour coded class 
labels 


Fig. 3 Output of kernel 
PCA, with a linear kernel 


Fig. 4 Output of kernel 
PCA, with an RBF kernel 


Fig. 5 Output of kernel 
PCA, with a Cosine kernel 
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However, often it is beneficial to apply a non-linear PCA, e.g., a kernel PCA. Non- 
linearity is beneficial if the input features do not follow a linear pattern, as in Fig. 2. In 
this case, the linear PCA is not capable of finding essential components, as shown in 
Fig.3. Therefore, kernel PCA implicitly calculates the covariance matrix of a higher 
dimensional representation of the input. The well-known Vapnik—Chervonenkis 
theory [35] states the effect that data transformed into a higher dimensional span 
often provides linearly separable features. 

Therefore, we first transform the input data with the kernel function into a higher 
dimensional representation. In the second step, the kernel calculates the dot product 
of the transformed data to obtain the covariance matrix [3, 29, 36]. This combination 
of non-linear transformation and dot product is referred to as the kernel-trick. Once 
the covariance of the transformed data is calculated, the PCA algorithm is applied. 
We can interpret the resulting eigenvectors of the kernel PCA as projections from the 
higher dimensions onto the principal components. After applying the kernel PCA 
we often obtain better features, as seen in Fig. 4. 

Each kernel has a different characteristic, e.g., the higher dimensional features 
obtained the radial-basis function (RBF) kernels yielding infinite dimensionalities 
[3, p. 297], while other kernels have different characteristics. Thus, it 1s important 
to utilize the kernel that is most suitable for the data. Figure 2 shows the input to the 
different examples of PCAs. The input has two features, x; and x2. The circular data 
presented here has two color-coded labels to indicate a reasonable and non-reasonable 
transformation concerning the class label. 

Figure 3 shows a non-reasonable result, where the linear PCA is not capable of 
extracting meaningful latent features z; and Zz. In Fig.4 we see the results of a 
RBF kernel applied to the input data. We observe that a single PCA component (zz) 
is sufficient to separate the different classes, while in Fig.5 the separability of the 
two color-coded classes is even decreased when compared to the original input. For 
additional information on kernels and in particular kernel PCA refer to [3]. 


3.2. Deep Architectures for Latent Feature Extraction 


In this section, we explain deep architectures that allow for latent feature learning. 
While in the sections above, traditional algorithms from the field of feature selection 
and extraction are explained, this section focuses on modern architectures that permit 
the extraction of useful features. 

We focus on autoencoders that are capable of learning a representation z of the 
original input x by constraining the learning process or the representation Zz. In partic- 
ular, we are interested in methods that allow determining latent features, while reduc- 
ing the number of features for further processing and keeping the relevant information 
at the same time [2, 8]. As stated earlier, reducing the number of features to reduce 
computational effort is an essential concept in feature extraction. Correspondingly, 
this section focuses on undercomplete autoencoders to learn a compressed represen- 
tation of the data. 
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Undercomplete autoencoders learn a representation z which is smaller than the 
original input x. This architecture is contrary to overcomplete autoencoders, in which 
the representation of the data is higher dimensional than the number of input features. 
Also, undercomplete autoencoders are more common, because once we learned the 
compressed representation, the computational effort for further processing 1s reduced 
in comparison to the original input and the overcomplete representation. 

In the following sections, we distinguish between generative and discriminative 
models. While generative autoencoders learn the distribution that is most likely to 
explain the data, discriminative autoencoders learn an efficient data encoding. Dis- 
criminative autoencoders are more convenient to implement and sufficient for most 
tasks. Generative autoencoders have the advantage that we can use them for denois- 
ing, imputation of missing values, and sampling data. 

Another way to learn latent features are deep belief networks (DBNs), and their 
underlying restricted Boltzmann machines (RBM). They learn a distribution of the 
underlying latent features from which new samples of the input data can be drawn. 
In an RL setting, these are pre-trained using contrastive divergence, and afterward, 
they need to be fine-tuned to the regression or classification task using e.g., stochastic 
gradient decent (SGD). During this fine-tuning phase, DBN and RBM behave like 
a normal multi-layer perceptron (MLP) and therefore are not much different than, 
e.g., a denoising autoencoder (DAE). Due to this similarity we decided not to explain 
RBM and DBN in more detail. 

Therefore, in the following sections, we explain three types of discriminative 
autoencoders giving an introduction to the concept of autoencoders. In the final 
section, we extend this idea with a generative approach. 


Autoencoder: An autoencoder (AE) is a variant of an MLP that learns an encoding 
Z = f(x) andadecoding x = h(z), where x are the input features and z is the encoded 
version of x. Due to the reconstruction of x from z the input and output layers have 
the same size [8]. 

Even though AE architectures are diverse, probably the most common architecture 
is the undercomplete AE as shown in Fig. 6. Undercomplete autoencoders reduce the 
dimension in each layer starting from the input layer. This side is called the encoding 
side, learning the mapping to encode x with a function f. At the center, also called 
the bottleneck of the AE, the layers are mirrored to produce the decoding side of the 
AE to reconstruct x with a function h. 

That is, the AE is trained to reconstruct the input on the output side. The idea is 
that the bottleneck serves as a feature extractor of the input data. Due to the reduced 
dimensionality at the bottleneck, the AE is forced to learn latent features or an 
efficient encoding that is sufficient to reconstruct the original features. In particular, 
the smaller dimension of z compared to x assures that the MLP is not learning the 
identity function. The following function typically describes the objective of the 
training [8]: 


L(x, h(f(x))), 
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Fig. 6 An example undercomplete AE topology. The AE reduces the dimensionality in each layer 
of the encoder. The latent features representation, at the bottleneck, are the extracted hidden features 
that are sufficient to reconstruct the original input successively in each layer of the decoder 


where L is a loss function, e.g., a squared loss, that penalizes the dissimilarity between 
x and the reconstruction h(f (x)). 

After training the AE, we cut the network behind the bottleneck, and attach a 
conventional ML algorithm. Using the learned encoding as an input to the regression 
or classification model is similar to using components of a kernel PCA. 

It can even be shown that when the decoder is linear, and we use a squared error loss 
function, the latent features of the AE are in a similar sub-space to PCA. Moreover, 
by using singular value decomposition, it is possible to reconstruct the original PCA 
components [27]. The results from [27], details the similarities of PCA and AEs. To 
be capable of comparing forecast results obtained from AEs with the more advanced 
techniques such as the nonlinear PCA, we extend the idea of AE to more complex 
structures utilizing the potential of deep architectures for representation learning 
even further. 


Denoising Autoencoder: An undercomplete AE, as described above, learns latent 
features by going through a bottleneck. We achieve a similar form of restriction 
by adding a noise term to the input features. DAEs are what is known as regular- 
ized autoencoders, allowing similar results, even with overcomplete architectures. 
In practice, however, an undercomplete DAE is used to minimize the following loss 
function 


L(x, h(f(x))), 
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where x is acorrupted version of x. Therefore, we typically draw from a unit Gaussian 
distribution so that we obtain a matrix with a shape similar to the original input data. 
In the next step, we scale the random data by a constant c and add the scaled random 
data to the original input. A typical value of c 1s, e.g., 0.1. By reconstructing the 
original input features from the corrupted features, the DAE learns the structure of the 
input distribution. The combination of regularization and undercomplete architecture 
yields the following properties: 


— The bottleneck minimizes the computational effort, once the DAE 1s trained. 

— The regularization removes noise which is common, e.g., 1n sensor data. 

— Bengio et al. [2] reports that DAE often improves the quality of the learned hidden 
features, by being less noisy. 


Convolutional Autoencoder: In the following section, we accentuate the specifics 
of convolutional neural networks (CNNs) in the context of autoencoders [23]. There- 
fore, we initially review the basic principle of CNNs and explain how it can be used 
to extract relevant features in the context of time series and continue with an explana- 
tion on how to utilize them with AE. Even though two and three-dimensional CNNs 
are more common in the context of natural language processing (NLP) or vision, 
we limit our explanation to the one-dimensional CNN. This limitation makes the 
explanation easier. Further, two and three-dimensional CNNs reuse the filter weights 
along their higher dimensions, therefore using the information of different features in 
each convolutional step. This behavior is desired in image processing but might lead 
to diluted features in time series, such as sensor data or NWP data. These conditions 
limit the validity of learned features and are not desirable in time series problems. 

In Fig.7, we see an example of a CNN layer with a filter size of 1 x 3, applied with 
the inner product to the input matrix size of 1 x 10. We can consider the input matrix 
as a Single input feature with ten time steps. Applying the filter to the input feature 
matrix mainly extracts features over 10 time steps. By applying the filter size of 1 x 3 
with a stride size of 1, we obtain an output shape of | x 8. By including, padding 
of 1, zeros are added to the beginning and end of the time series, increasing the size 
to 1 x 12. When we apply the filter to the time series, we avoid a dimensionality 
reduction compared to the original input data, and after the filter application, the 
original input shape of 1 x 10 is maintained. 

In case the CNN layer has multiple filters, we obtain an output of fg, ... , to for each 
of those filters. By applying these filters, the CNN is capable of extracting temporal 
features. These filters are often beneficial as they maintain relations between and 
future time steps [34, 37]. 

If this procedure is now repeated in the encoder with the same kernel size, and 
in each layer, the dimension of a filter’s output is reduced, as seen in Fig. 8. Further, 
by decreasing the number of the filters in each successive layer, the latent feature 
representation at the bottleneck is obtained. In this bottleneck, the learned feature 
representation takes simple temporal features into account to represent the data. It 
is worth noting that we are not using a pooling layer and therefore only decrease 
the number of input features and convolving over time. A more detailed overview of 
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Fig. 7 Example of one-dimensional CNN with a filter size, 1 x 3, applied to an input time series 
size of | x 10 with additional padding. The additional padding allows to keep the dimension of the 
time series and to extract relevant information 
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Fig. 8 An example convolutional AE topology. The convolutional AE reduces the dimensional- 
ity in each layer of the encoder and keeps the temporal information through padding. The latent 
features representation, at the bottleneck, are the extracted hidden features including a temporal 
representation of the input to reconstruct the original time series in the decoder 
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different data and their corresponding n-dimensional convolutions is given in [8]. On 
the decoder site, so-called deconvolutional layers are used to reconstruct the original 
time series. Deconvolutions use an inverse convolution to reconstruct the original 
time series, e.g., with the same loss function as explained for the vanilla autoencoder 
in Sect. 3.2. The properties of the convolutional autoencoder can be summarized as 
follows: 


— In time series application one-dimensional filters are sufficient because higher 
dimensions impose conditions on the order of the features [8, p. 349]. 

— The combination of CNN and AE forces the algorithm to learn simple temporal 
features that are sufficient to reconstruct the original time series. 


Variational Autoencoder: The drawbacks of the discriminative architectures in the 
previous sections are that they cannot be used to reconstruct missing values or gen- 
erate new samples. Variational autoencoder (VAE) are a generative approach that 
extends the idea of a simple autoencoder by adding a constraint on the encoding site 
to generative properties. 

The encoding side is forced to learn the mean jz and standard deviation o of a 
Gaussian distribution, as shown Fig.9. and o are used to create latent features 
z by sampling from a unit Gaussian scaled with the learned jz and o; also called 
reparameterization trick [16]. The scaled samples are used to reconstruct the original 
features x with a function /. More formally this can be done using a loss function: 


L(x, h(q(z|x))) — Dex (q(ZIx) || p()), 


where gq (z|x) is the scaled version of the unit Gaussian given the current input x and 
the Kullback-Leibler Divergence Dx, see Sect. 4.2, penalizes the deviation between 
the learned distribution g from a unit Gaussian. 

By applying the reparameterization trick, it is possible to extend the original idea 
of an AE and achieve the following properties: 


— Often the combination of a generative network with an encoder forces the VAE to 
learn a representation in a much lower dimensional space, see [8, p. 699] and [16]. 
— The decoder and the latent vector provide a generative framework. 


4 Evaluation of Representation Learning in Regression 
Tasks 


To evaluate representation learning in power series forecasting, we have to consider 
three aspects. Firstly, we need to evaluate the overall performance of the feature 
learner. Secondly, we need to evaluate how the actual regression model performs. 
As methods for assessing the performance of the feature learner and the regression 
model are similar, both are detailed in the same section. Thirdly, we need to measure 
how well our latent features perform. Such a performance measure for latent features 
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Fig. 9 An example VAE. The VAE reduces the dimensionality in each layer of the encoder and 
learns vectors of 44 and o from a Gaussian distribution. The vectors are used to scale samples 
from a Gaussian distribution. The scaled samples are used to reconstruct the original features at the 
decoding side 


needs to reveal how much information our learned representation maintains and how 
well this representation performs in a forecasting task. 


4.1 Evaluation of a Regression Model 


To measure the performance of a solution for a regression problem, we can calculate 
the difference between the model output T and the actual time series T. Such perfor- 
mance measures are also called time series measures and typically compare the data 
time series at each point in time. As mentioned in Sect.2, we work on equidistant 
time series. Therefore, the root mean squared error (RMSE) or mean absolute error 
(MAE) are good measures to compare the T and T. Other measures include the time 
information of the regarding time series, such as dynamic time warping or time warp 
edit distance [3, 31]. 


N 
RMSE(T,T) = 7 es — %,)2 (3) 


N 
MAE(T,T) = ~ Y— |x: — &:| (4) 


The RMSE, as seen in Eq. (3), and the MAE, as seen in Eq. (4), use the data of 
the model output T and the actual time series 7. They first calculate the difference 
between both data points. The RMSE then squares this difference, averages the 
values, and takes the square root of the average. Therefore, RMSE is non-negative 
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and gives the average distance between the data points and the model output. The 
MAE similar to the RMSE is non-negative, as it takes the absolute difference of 
the regressed time series and the original time series, and averages it over all data 
points. The main difference between those two measures 1s that the RMSE penalizes 
substantial differences between the two time series more than then MAE [17]. 

During the training of the representation learner, we use the RMSE to evaluated 
the reconstruction loss. Later on, we also employ the RMSE to compare the model 
quality of different learned representations. 


4.2 Evaluation of the Learned Feature Representation 


Assessing the influence of the features on the regression model output is another way 
to evaluate the input features of a regression model. Such an assessment uses one set 
of features, e.g., the learned feature representation, to determine their performance in 
comparison to another set of features, e.g., the original input features. Furthermore, 
we can gain information about the amount of compression, contained information, 
and the features ability to improve the regression. We explain three different mea- 
sures to compare the feature representations, learned or not. These measures are the 
compression rate, the Kullback-Leibler Divergence (KLD), and correlation-based 
measures. 

First, we start by giving information about measuring the compression rate we 
achieve. This information allows us to group algorithms with similar compression 
rates and to evaluate within these groups. 


Uncompressed Size 
Compression Rate = ————————__—_- (5) 
Compressed Size 


. Number of input features 
Compression Rate = ———_——\+ 
Number of latent features 


The compression rate as seen in Eq. (5) is the ratio between input data and output 
data. In our case, we compare the number of features in the input layer to the number 
of features after the encoding. This comparison allows us to compare several metrics 
on different datasets grouped by the compression rate. 

An essential measure to assess the learned feature representation is the mutual 
information which is based on the KLD. The KLD allows us to measure the similarity 
of two distributions. The mutual information allows us to measure the influence of 
each latent feature with the regression model output [3]. For simplicity, we limit the 
explanation to the discrete case of distributions. 





_ P(x) 
Dxt(P || Q) = 2, P(x) log (5 ~) (6) 


M1I(X; Y) = Dei (P(X, Y) || P(X)PW)) (7) 
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Fig. 10 The power 
production target plotted 
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The KLD in Eq. (6) is used to compare two discrete distributions Q and P. It 
measures the similarity of the two random variables X and Y withx € X andy e€ Y. 
The KLD is not symmetric, 1.e., Dx_(P || OG) # Dg _(Q || P). If both P and Q, are 
feature representations, we obtain information about their relative entropy. Using 
the mutual information (MI), see Eq. (7), with X being the feature representation 
and Y the original power time series, we obtain information about how well Y can 
be encoded using the current feature representation X [19], allowing us to compare 
different learned feature representations regarding their ability to contribute towards 
the regression model output. 

KLD can be used to calculate the MI, or relative entropy of two distributions, e.g., 
when comparing the distributions in the feature space, or calculating the information 
loss of a linear model performed with the original input features to a linear model 
performed with the learned features. This approach is similar to the way t-SNE works 
[33]. 

Furthermore, we can apply correlation measures, such as Pearson’s correlation 
coefficient, as shown in Eq. (8). The correlation coefficient quantifies how well 
our learned features—again, considered as a pair of random variables X and Y— 
linearly correlate with the power time series. Therefore, we identify these feature 
representations in a linear regression task. 


eS (8) 


In addition to the correlation coefficient, we can measure the influence of different 
latent features using an analysis of variance (ANOVA) [4]. ANOVA allows us to 
identify the influence of certain features on the power time series. Consequently, 
creating a measure that identifies if all features of the representation contribute to the 
regression task or if there are just a few contributing features. With a good feature 
representation, all features should contribute equally to the regression model [9]. 

Furthermore, the correlation analysis can also be complemented with simple visu- 
alization techniques, such as a scatter plot, as shown in Fig. 10, that allow us to 
evaluate our learned features against our target variable in the regression task. 
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5 Representation Learning Applied in Power Time Series 
Forecasting 


In this section, we detail the process of RL in power time series forecasting. We 
utilize the information we described in the previous section, to provide a data analysis 
process for power time series forecasting. The focus of this section is on RL and its 
evaluation for renewable energy tasks. However, we provide generic descriptions so 
that similar evaluation processes can be defined in other domains. 


5.1 The Power Time Series 


In total, we evaluate three different power time series datasets. All of these datasets are 
a combination of an NWP model and the measured power production or consumption. 
These datasets include: 


— Europe Wind Farm dataset, 
— German Solar Farm dataset, and 
— GEFCOM 2014 dataset. 


The Europe Wind Farm and German Solar Farm dataset can be downloaded from 
our website” and the GEFCOM2014 is also publicly available online.’ These datasets 
make our data quite diverse, and we cover a broad spectrum of power time series 
forecasting. 


Europe Wind Farm Dataset: The Europe Wind Farm Dataset consists of the data 
from 45 wind power plants scattered across Europe. The dataset provides the NWP 
data as well as the corresponding power output normalized according to the installed 
capacities. In addition to the available features in the dataset, we augmented the 
available features using lh and 2h time-shifted features for wind speed and wind 
direction allowing time-dependent changes of the future and past weather to be 
taken into account, see Sect. 2.1. 


German Solar Farm Dataset: The German Solar Farm dataset consists of the data 
from 21 photovoltaic facilities in Germany. Their installed nominal power ranges 
are between 100 and 8500 kW. The PV facilities range from PV panels installed 
on rooftops to full-fledged solar farms. All these facilities are distributed through- 
out Germany [6]. Analogous to the Europe Wind Farm dataset, they provide the 
corresponding NWP and the power time series which are normalized to the corre- 
sponding installed capacities. Again, we augmented the available features using 3h 
time-shifted features for sun position, solar height, clear sky, and radiation. 


*https:// www.ies.uni-Kassel.de, last accessed: April 2019. 
Shttp:// dx.doi.org/10.1016/j.ijforecast.2016.02.001, last accessed: April 2019. 
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Gefcom 2014 Dataset: GEFCOM, or Global Energy Forecasting Competition, is a 
dataset based on the GEFCOM2014 forecasting challenge [11]. For the 2014 com- 
petition, four different tracks were available, 1.e., electric load, electricity price, wind 
power, and solar power forecasting. In the 2014 challenge, the main task was proba- 
bilistic forecasting. In this work, we only use the data of the wind power forecasting 
track, which is relevant for deterministic forecasts. In the GEFCOM dataset, we did 
not take time-shifted features into account. 


5.2. Representation Learning Experiments 


In the following section, we explain the RL experiments step by step. Figure 11 
summarizes the training process of an individual experiment: 


— We split the dataset into training, validation, and test datasets. 

— We train an RL technique with k hidden features on the training dataset and cal- 
culate the compression rate as given in Eq. (5). 

— In those cases where we test different variables of hyperparameters, we select 
the best model based on the evaluation results in the validation dataset. Using 
the RMSE as evaluation measure allows the reconstruction quality of the original 
features to be assessed, see Sect. 4.1. 

— After selection of the RL model, we train the power forecast models on the latent 
features and power of the validation dataset. 

— In the end, the ML models are evaluated on the test dataset. 


We repeat the whole procedure for each of the datasets mentioned in Sect.5.1, 
our selected RL techniques, and between 2 and 9 latent features corresponding to 
different compression rates, depending on the number of input features contained 
in the dataset. We focus the evaluation measures on two aspects, as these build 
the foundation for more advanced techniques such as KLD and correlation-based 
measures. In the first aspect, we show examples of the relation between reconstruction 
loss and the compression rate. In the second aspect, we evaluate the forecast error 
(RMSE) of the models concerning the compression rate. 


Preprocessing of the Data: We split all of the datasets into training, validation and 
test sets. This splitting makes it possible to train and select the best model based on 


Select Best 
Representation Model Based on 
Learning on Reconstruction 
Training Loss and 
Dataset Compression 
Rate 


Train Machine Test Machine 
Learning Learning 


Algorithm on Algorithm on 
Z,y of the Z,y of the Test 
Validation Data Data 





Fig. 11 Overview of RL steps. z refers to the latent features and y is the power generation 
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the validation set. The last split, the test, set 1s then used to do a final evaluation of 
the regression task on unseen data. For preprocessing, we assure that we normalize 
the data to have zero mean and a standard deviation of 1. The values of standard- 
ization parameters from the training dataset are applied to the validation and test 
dataset afterward. Further, we normalize the generated power between 0 and | by the 
maximum power generation of each farm. We also avoid categorical input features, 
because those only roughly describe the weather phenomena but have a considerable 
influence on the training process of the AE, especially on the reconstruction. Often, 
when we used categorical features, the AE learned to reconstruct those categorical 
features and was not capable of reconstructing the nominal weather features from 
the hidden representation. Therefore, we avoid categorical features in all experi- 
ments. By adding the time-shifted features, we allow features with time dependency 
as those are relevant for time series forecasting. For details refer to Sect. 2.1. In the 
Europe Wind Farm dataset we use four different time-shifted features.* Similarly, 
four shifted features” are added for the German Solar Farm dataset. 


Applied Machine Learning Models: In the experiments, we always use the same 
set of hyperparameters for the support vector regression (SVR) and the MLP, the 
ML models. 

For both models, we use the standard parameters given by the scikit-learn frame- 
work [26]. The SVR uses an RBF kernel and we train it without a hard limit on 
the iterations. The MLP uses one hidden layer with 100 neurons, ReLu activation 
functions. We train with the Adam optimizer [15] for a maximum of 200 iterations. 


Applied Time Series Measures: For all of our experiments we use the reconstruction 
loss and the forecast error, see Sect. 4 for more details. The reconstruction loss allows 
us to measure the maintained information within our representation. The forecast 
error allows us to determine if the latent feature representation performs well in a 
forecasting task. These two measures are the most intuitive and well-known ones in 
forecasting tasks. 


Guidelines for the Training: This section examines the training process to provide 
a guideline for use in other domains. 

As previously mentioned, we select the best performing model based on the recon- 
struction loss of the validation dataset. After selecting the RL model we encode the 
input features x to latent features z. Afterward, z acts as the new input to the ML 
model. By using the validation dataset, we minimize the risk that the RL model is 
overfitted to the training data and as a result, it does not generalize well on unseen 
data. Furthermore, using the validation dataset assures, that the ML learns data not 
seen during the training of the RL. We evaluate the final model using the test dataset 
and the measures introduced in Sect. 4.2. 

In our experiments, one significant difference between the evaluated RL models 
concerns the latent features. In the case of PCA, the latent features are the number 


+WindSpeed100m, WindSpeed10m, WindDirectionZonal100m, WindDirectionMeridional100m. 


> SolarRadiationDirect, SolarRadiationDiffuse, SunPositionSolarHeight, SunPositionSolarAz- 
imuth. 
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of components which first transform the input data into a higher dimensional space 
by using kernel functions. In the case of the AEs and DAE, see Sect. 3.2, the number 
of latent features is equal to the neurons in the bottleneck. In the case of a VAE, see 
Sect. 3.2, the number of latent features is equal to the number of learned jw of the 
normal distribution, as in practice ju is sufficient as expectation. For the convolutional 
autoencoder (CAB), see Sect. 3.2, since the feature representation includes informa- 
tion about 24 time steps of the NWP. It is important to note that in none of the deep 
learning-based architectures have we transformed the data into a higher-dimensional 
feature space. Also note that including the time-shifted features in the input forces 
the RL model to determine latent features that include the time dependency of input 
features. 

In future applications, it might be necessary to select the RL model with a specific 
compression rate, see, e.g., Fig. 12, depending on the reconstruction loss. Selecting 
an appropriate model with a specific compression rate can reduce the computational 
effort of certain ML models, as their computational effort increases with the number 
of input features. 

In the following sections, we describe our experimental results and give details 
on the training procedure and the hyperparameters for RL in power time series fore- 
casting. We show how the different proposed RL approaches perform compared with 
traditional approaches. Therefore, we apply four different types of AEs, as well as 
linear and non-linear PCA on the dataset to learn and extract new features. We use 
the latent features as input in a regression forecast model. This model maps the latent 
features from the NWP data to the power time series. In particular, we are interested 
in showing the advantages and disadvantages of the evaluated RL methods. 

We try to explain everything in a manner that allows for the easy repetition of the 
experiment in other domains. Therefore, we use the state-of-the-art machine learning 
framework scikit-learn [26] in connection with pytorch [25]. 

In the following section, we first evaluate the traditional feature extraction tech- 
nique, PCA, in Sect. 5.3. Afterward, we evaluate the RL methods for feature extraction 
in Sect. 5.4. In Sect. 5.6, we discuss the results achieved by both methods. By separat- 
ing the evaluation for traditional and RL methods, we aim to derive recommendations 
on how to apply RL to power time series forecasting. 


5.3 Principal Component Analysis for Feature Extraction 


In this section, we highlight the results obtained by PCA. We use PCA as a reference 
because it extracts hidden features and can reduce their number at the same time, 
see Sect. 3.1. This extraction and selection process permits comparisons to the deep 
architecture based RL methods. In contrast, filter and wrapper-based methods do not 
allow for the extraction of new hidden features, see Sect. 3.1. Further, by evaluating 
different kernels, we show their characteristics concerning the compression rate and 
the regression task. Assessing these values allows a wide number of representations 
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Fig. 12 Reconstruction loss 
by linear PCA over different 
compression rates based on 
the German Solar dataset 


Fig. 13 Reconstruction loss 
by Rbf PCA over different 
compression rates based on 
the German Solar dataset 


Fig. 14 Reconstruction loss 
by Cosine PCA over 
different compression rates 
based on the German Solar 
dataset 
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to be compared with the same algorithm, similar to the different deep learning-based 


representations. 


Figures 12, 13, and 14 show the reconstruction loss of three different kernels 
PCAs applied to the German Solar Farm dataset. In all cases, we observe that when 
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increasing the compression rate the reconstruction loss increases as well. This obser- 
vation is to be expected, as the decreasing number of hidden features limits the 
available information when performing a reconstruction. Alternatively, in terms of 
PCA, the number of components is not sufficient to reconstruct the full variance of 
the data. 

However, the figures presented here illustrate the different behaviors of the applied 
Kernel. In case of the /inear kernel in Fig. 12 we observe an almost constant recon- 
struction loss which then increases quickly. In contrast, the reconstruction loss of the 
rbf kernel increases rapidly after a compression rate larger than 8. The reconstruction 
loss of the cosine kernel is roughly constant at a median reconstruction loss between 
0.41 or 0.43 until a compression rate of 11.2. The loss then increases up to an RMSE 
of 0.49. We also note, that for the cosine kernel, we have at least two outliers for 
every compression rate. Comparing all techniques, we observe that the rbf kernel 
has the lowest reconstruction error, followed by the cosine and the linear kernels. 

Figures 15 and 16 summarize the results of the MLP and the SVR. We achieve 
these results by training the ML model on the extracted features from each kernel 
PCA for all compression rates. Correspondingly, these figures show the relationship 
between the forecast error and the compression rate. 
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The results show that the linear kernel has the most substantial RMSE deviations. 
The median RMSE of the linear kernel for the MLP increases with the compression 
rate. The RBF is the best performing kernel for the MLP, as it has the lowest median 
RMSE when compared to the other kernels or at least a similar median RMSE. The 
RMSE for both non-linear kernels behaves similarly to the MLP model, with only 
slight changes in lower compression rates. The SVR shows a similar RMSE behavior 
but with more variations throughout the different compression rates. 

The results for the Europe Wind Farm dataset are shown in Figs. 17 and 18. The 
compression rates vary between 2.67 and 12.0. The linear kernel has the lowest 
median forecast error for both ML models on all compression rates up to a compres- 
sion rate of 8. From a compression rate of 8, the cosine kernel seems to be performing 
better for all ML models in comparison to the other kernels. It is worth noting that 
all kernels show some outliers in forecast error. 

Figures 19 and 20 show the results of the GEFCOM2014 Wind dataset. Due to 
the amount of input features, the compression rates vary between 1.44 and 6.5. It 
can be seen that for most compression rates the cosine kernel performs well for both 
ML models. The median error of the cosine kernel varies between 0.225 and 0.26 
for the MLP model and between 0.21 and 0.26 for the SVR model. 
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5.4 Deep Architectures for Feature Extraction 


Deep architecture based RL often provides additional characteristics such as denois- 
ing and inferring missing values that are not achievable with other algorithms. How- 
ever, we need to carefully select relevant hyperparameters and preprocessing steps 
to obtain good results. This section first gives a comprehensive justification on the 
hyperparameters followed by the evaluation of the different autoencoder types. Note 
that the results of the reconstruction loss for the autoencoders are similar to the ones 
from PCA and are therefore not shown in the results. 


Common Parameters in the AE Experiments: In many cases, it is complicated 
to select a set of hyperparameters that permit us to achieve excellent results in deep 
learning. However, recently, a few techniques (e.g., Adam, Xavier initialization, 
and batch normalization) have made it possible to achieve excellent results while 
minimizing the number of hyperparameters that need to be tuned. The following list 
describes a selection of the important parameters: 
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Fig. 21 Forecast error of MLP 
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— We vary the number of latent features between 9 and 2. These numbers provide 
a broad range of compression rates depending on the dataset showing the effects 
concerning input features. 

— For AE, DAE, and VAE, learning rates of 0.001, 0.0005, and 0.0001 are evaluated. 
For CAE we test the learning rates of 0.01, 0.001, and 0.0001. 

— Leaky ReLus are used as activation functions. 

— The Adam optimizer is used to train the network. 

— Ineach layer, the number of neurons is reduced by a factor of 0.8 making it possible 
to create deep nets that successively reduce the number of features to the required 
number of latent features. Note that another common possibility is first to increase 
the number of neurons compared to the original input features. In a sense, this 
would be similar to the transformation of the nonlinear kernel PCA, but we do not 
consider it. 

— Utilizing Xavier as initialization, as a state-of-the-art method to initialize weights, 
minimizes the risk of exploding gradients [7]. 

— Similar advantages are achieved by normalizing the input (e.g., avoiding exploding 
gradients). Therefore, we use batch normalization in each layer [12]. 

Preliminary examinations showed that using batch normalization for AE, DAE, 
and VAE achieves at least similar good results as without. 


Summary of the Evaluation Results for the AE, DAE, VAE and CAE Architec- 
tures: The results of the deep architectures for the German Solar Farm dataset are 
shown in Figs. 21 and 22. In all cases the AE and DAE have a predominantly similar 
median RMSE and forecast error deviation. Compared to the other RL models, the 
CAE has the highest median RMSE values for both ML models and the VAE has 
the smallest median RMSE for both ML models, except for a few compression rates. 
For the MLP, the VAE obtains a median RMSE comparable to the PCA experiment. 

Figures 23 and 24 show the results for the Europe Wind Farm dataset. The AE and 
the DAE have the smallest median forecast error, where the DAE has a slightly smaller 
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Fig. 22 Forecast error of SVR 
SVR model applied to latent 
features obtained from AE, 
DAE, VAE, and CAE, 
evaluated for the different 
compression rates for the 
German Solar dataset 
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Fig. 23. Forecast error of MLP 
MLP model applied to latent 
features obtained from AE, 
DAE, VAE, and CAE, 
evaluated for the different 
compression rates for the 
Europe Wind Farm dataset 
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standard deviation. For smaller compression rates the two models have similar results 
to those of the PCA, but slightly improve for higher compression rates. The VAE 
has a similar forecast error for all compression rates and forecast models. The CAE 
performs a bit worse than the AE and DAE. All RL models produce some outliers 
regarding the forecast error. 

The results of the GEFCOM2014 Wind dataset are shown in the Figs.25 and 
26. The AE and DAE perform similarly on both forecast models. Both of these RL 
models also have the best performance for smaller compression rates. For the more 
conspicuous compression rates, the VAE performs the best. The CAE is the worst 
performing model with an high overall RMSE. 
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Fig. 24 Forecast error of SVR 
SVR model applied to latent 
features obtained from AE, 
DAE, VAE, and CAE, 
evaluated for the different 
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Fig. 25 Forecast error of MLP 


MLP model applied to latent 
features obtained from AE, 
DAE, VAE, and CAE, 
evaluated for the different 
compression rates for the 
GEFCOM2014 Wind dataset 
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5.5 Fine-Tuning for Power Forecasting 


In the previous section, we use the learned feature representation directly and train 
ML models based on those. However, fine-tuning provides a more sophisticated 
approach towards forecasting power time series with previously learned AEs. The 
problem with the previously mentioned approach is that the autoencoder’s weights 
are not optimal for the forecasting problem. However, we optimize the weights to 
reconstruct the input features from a smaller feature representation. Apparently, due 
to this unsupervised learning process for the autoencoder, the learned representation 
of the autoencoder might not be ideal for forecasting power time series. Fine-tuning 
tries to overcome this problem by partly updating the weights of the trained AE for 
the supervised task of power time series forecasting. 
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Fig. 26 Forecast error of SVR 
MLP model applied to latent 
features obtained from AE, 
DAE, VAE, and CAE, 
evaluated for the different 
compression rates for the 
GEFCOM2014 Wind dataset 
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Fig. 27 Forecast error of 
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For our scenario, this means that we partially re-train the previously learned 
encoder. First, we add a linear layer, equal to the MLP from the previous experiment, 
to the bottlenecks of the AEs. Correspondingly, we have a hidden layer with 100 
neurons connected to the number of latent features of the encoder. Furthermore, we 
add an output layer to map the 100 neurons to the power time series. Second, in the 
process of fine-tuning, we restrict the adaptation of weights to the last four layers: 
The output layer of the MLP, the hidden layer of the MLP, and the last two layers 
of the already trained AE. This restriction minimizes the training effort and makes 
the best use of the previously learned representation. 

Figure 27 illustrates the results of the procedure described above. In contrast to 
the default parameters of scikit-learn, we use a weight decay of 0.2 as described 
in [21] for AE and DAE. Furthermore, we train for 2000 epochs with a batch size of 
2048. 

The fine-tuning improves the previous results shown in Fig.21, even though we 
train the same number of neurons compared to the previous experiment on the MLP 
model. By using fine-tuning, the median RMSE improves to values between 0.09 
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Fig. 28 Forecast error of fine-tuned MLP model based on AE, DAE, VAE, and CAE. Evaluated 
for the different compression rates for the Europe Wind dataset 
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Fig. 29 Forecast error of fine-tuned MLP model based on AE, DAE, VAE, and CAE. Evaluated 
for the different compression rates for the GEFCOM2014 Wind dataset 


and 0.1 for all compression rates for the encoder of the AE and DAE. The median 
RMSE of the VAE is about 0.01 higher in comparison to the other fine-tuned models. 

Comparing to the PCA results in Fig. 15, we can see that for compression rates 
higher than 6.2 the finetuned AE and DAE have smaller median RMSE values. 
Further, in all cases, the fine-tuned models have at least a similar small standard 
deviation of the forecast error. 

The Europe Wind Farm dataset achieves a more substantial improvement, as seen 
in Fig. 28. The median RMSE decreases to 0.15 up to a compression rate of 6 and 
then increases to 0.19 for the best models. These results are an improvement over the 
best PCA, with the smallest median RMSE of 0.1755 for lower compression rates. 

We obtain similar improvements with the GEFCOM2014 Wind dataset, as shown 
in Fig. 29. For almost all compression rates the median RMSE is around 0.175. These 
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results are an improvement over the best PCA, with the smallest median RMSE of 
0.225. 


5.6 Key Insights 


In the previous sections, we explained experiments and well-known algorithms aim- 
ing to use representation learning techniques for power time series forecasting, see 
Fig. 11. For the feature extraction process, we started with the well-known PCA. One 
significant difference compared to the deep architectures is, that the kernel PCA first 
transforms the input data into a higher dimensional space. The eigenvectors from the 
kernel PCA can be considered as a projection from the higher dimensional space, 
permitting reasonably good forecast results for all evaluated datasets. 

This concept is contrary to the architectures and methods used in the first autoen- 
coder experiment, where we directly reduce the dimensionality by a factor of 0.8 
in each layer. The final dimension, at the bottleneck, is similar to the number of 
components used in the kernel PCA experiment. Therefore, we directly compare the 
forecast error of the autoencoder to the results obtained from the kernel PCA based 
on the different compression rates. 

In the first deep learning experiment, the AEs perform slightly worse or similar 
to the kernel PCA. These results might be due to the transformation into a higher 
dimensional space by the kernel PCA. The Vapnik—Chervonenkis theory [35] can 
describe these effects. 

However, by utilizing fine-tuning, we are capable of achieving results superior to 
the kernel PCA by optimizing the already learned representation to its task. 

Overall the GEFCOM2014 dataset has the worst results compared to the other 
datasets. This result might be related to the missing time-shifted features that intro- 
duce the relevant time dependency required for time series forecasts. 


6 Conclusion 


This chapter proposed to introduce RL in the context of power time series forecasting. 
We did this by introducing traditional pre-processing methods such as feature selec- 
tion and feature engineering. Instead of manually finding useful input features for 
our ML task, we applied RL algorithms, especially ANNs with deep architectures, to 
learn latent features. We additionally showed how to evaluate the representation with 
and without successive ML algorithms to find good RL models. In most cases, we 
differentiated between distribution-based measurements and measurements applied 
to the output of ML models trained on the feature representation. In the end, we 
showed various examples of RL in the field of renewable power forecasting. 
Representation learning can be seen as the starting point for every ML task, as it 
obviates the necessity of domain knowledge and permits machine learning models 
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to increase accuracy. Even though we focused on forecasting power time series in 
this chapter, we provided generic concepts when possible and conclude this chapter 
by providing insights on how to build a successful RL model. 


6.1 How to Build a Representation Learning Model 


As we have shown in previous sections, RL replaces feature engineering and feature 
selection in the data mining process. Therefore, it precludes the necessity of doing 
manual feature engineering and selection. In this section, we propose some simple 
conventions to apply RL to regression or forecast applications as conveniently as 
possible. 

First of all, it is most likely that the choice of AE will not matter in regard to 
performance, as all of the chosen RL techniques perform similarly to or even better 
than, our baseline. To select the best RL for your problem the characteristics of your 
data need to be identified. Depending on those characteristics a corresponding AE 
can be chosen. Some proposals, as mentioned in Sect. 3, are: 


— CAE is good at maintaining and extracting temporal features and should be com- 
bined with, e.g., recurrent networks to make the best use of those features. 

— VAE can infer missing data. 

— DAE can reduce the amount of noise in the latent features. 


With the help of this information, and knowing the characteristics of the data, it is 
possible to identify the type of autoencoder that is needed. The next step is to choose 
the ML technique to map the latent features to the target data. We have shown that 
even simple algorithms, such as linear regression, SVR, or MLP will improve the 
essential metrics for the problem. In the case of MLP models, one should also think 
about adding a fine-tuning step for a task-specific adaption of the weights, as shown 
in Sect.5.5. This additional step permits the ANN to specialize even further towards 
the goal of the final ML task and improves the initial results considerably. 
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Abstract Load forecasting in smart grids is still exploratory; despite the increase of 
smart grids technologies and energy conservation research, many challenges remain 
for accurate load forecasting using big data or large-scale datasets. This chapter 
addresses the problem of how to improve the forecasting results of loads in smart 
grids, using deep learning methods that have shown significant progress in various 
disciplines in recent years. The deep learning methods have the potential ability to 
extract problem-relevant features and capture complex large-scale data distributions. 
Existing research in load forecasting tends to focus on finding predicted loads using 
small historical datasets and the behavior of the load’s consumers in smart grids. 
Moreover, current research which applies the conventional deep learning methods 
for load forecasting has shown better performance than conventional load forecasting 
methods. However, there is little evidence that researchers have addressed the issue of 
hybridizing different deep learning methods for complex large-scale load forecasting 
in smart grids, with the intent of building a robust predictive model in smart grids 
and understanding the relationships that exist between different predictive models 
and deep learning methods. Consequently, the purpose of this chapter is to provide 
an overview of how the load forecasting performances using deep learning methods 
in smart grids can be improved. 
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1 Introduction 


Today’s power system infrastructure has been developed and improved using new 
technologies in different aspects. The new concept of power system grids, “smart 
grids’, is a modern power system infrastructure that aims to build robust, reliable, 
efficient grids and minimize the cost of production. Enhancing the grids with renew- 
able energy resources, automated control, and communication technologies provides 
possible means of efficiency, reliability, and safety for smart grids. The objective of 
the smart grids is to advance the use of technology and communication dramatically 
by investing in the bidirectional flow of power and data. The smart grid infrastruc- 
ture is full of advanced sensing, communicating and computing abilities that work 
interoperable way in different power system parts, generation, and distribution [1]. 
The infrastructure scheme is illustrated in Fig. 1. 

The effectiveness of smart grids relies on three primary roles that can help maintain 
and manage the grids as follows: 


— Dynamic pricing. 
— Demand-side management. 
— Load forecasting. 


The implications from these roles highlight the need to consider the planning and 
operation of the power system. The dynamic pricing provides a real-time pricing and 
control [1]. An application of demand-side management is the demand response that 
can be categorized into these three aspects [2]: 


— Peak clipping: reducing peak loads to avoid exceeding the capacity of substations. 
— Valley filling: promoting energy storage devices during off-peak loads. 
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Fig.1 An overview of smart grids scheme. The physical part of the smart grids includes generation, 
transmission, distribution and electric loads. The power flow is stepped-up after the generation and 
stepped-down after transmission. The distribution power is measured with smart meters installed 
at the end-user’s side. The information flow is bidirectional from the generator side to the end-user 
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— Load shifting: shifting the energy consumption, e.g., shifting the energy demand 
from peak load time to off-peak load time using energy storage devices. 


Load forecasting is an essential task that predicts future energy consumption in 
order to meet the primary roles of smart grids at any time. In planning, operation, and 
control of the power system, load forecasting is a crucial primary element to define 
the distribution system capabilities that need to be obtained by the future system. 
If load forecasting is applied inaccurately, all relevant steps will affect the planning 
of future loads, and the entire planning and operation are at risk. Accurate load 
forecasting not only helps in optimizing the future generating units, it also saves the 
investment of future power facilities and helps to define the risk factors in planning, 
operation, and control tasks. Moreover, electricity price forecasting provides useful 
information for power suppliers and customers using a developed bidding system. 
Both suppliers and customers need accurate price predictions in order to establish 
their bidding strategies to maximize their profits and benefits. Therefore, to achieve 
the smart grids’ goals, accurate and efficient load and price forecasting has become 
a crucial technique. 

Although extensive research has been done using different physical and statistical 
models, accurate electric load forecasting remains a challenge in smart grids. Var- 
ious artificial intelligence techniques and machine learning algorithms used in the 
load forecasting problem are still insufficient to predict the load in the desired form 
accurately. Moreover, most of these models are based on small datasets, and their 
prediction errors are relatively high. Enhancing the smart grids with deep learning 
methods to forecast loads will provide accurate predictions and efficient predictive 
modeling as illustrated in Fig. 2. 

In this chapter, we will explore the importance of load forecasting in the energy 
industry and power systems; in particular, how energy consumption and electrical 
loads are reflecting the critical decisions in smart grids. We will research the tradi- 
tional deep learning methods used for load forecasting in smart grids and investigate 
the hybrid deep learning methods applied for load forecasting using a real big dataset. 


Electric loads ——, Load profiles c— Data preprocessing ——> Forecasting model 





Fig. 2 An overview load forecasting scheme. The aggregated electric loads are represented as 
load profiles using the smart meters. The CPU-based computer is used for data preprocessing e.g., 
data cleaning and data normalization. The GPU-based computer is used to process deep learning 
methods. The electric loads and load profiles are illustrated in Sect. 2. The data preprocessing and 
forecasting model are illustrated in Sect. 4 
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In the next section, we will demonstrate how load forecasting is becoming a signif- 
icant contributor to energy expansion, has different objective terms of periods, is 
affected by various influential factors, and has significant challenges of stochas- 
tic time series. In the third section, we will review the existing research of load 
forecasting extensively using conventional methods, machine learning methods, and 
deep learning methods, highlight the insight of existing issues and narrow the gaps of 
existing research using significant deep learning approaches. In the fourth section, we 
will elaborate different promising load forecasting models using hybrid deep learn- 
ing methods and compare their results with existing load forecasting approaches. 
In the last section, we will conclude with a summary, a balanced assessment of the 
contribution of load forecasting in smart grids, and a roadmap for future research 
directions. 

The content of this chapter would be useful to researchers interested in the field 
of electricity market forecasting as well as graduate students who research on elec- 
trical engineering problems; especially, load forecasting and energy consumption 
prediction. 


2 Background 


The evolution of many smart systems such as smart grids around the world has 
raised new challenges and opportunities for utility providers as well as households 
and enterprises. Before this development, energy suppliers and integrated utilities 
had less financial risks and energy management adventures; in addition, the end 
users did not have other option but to buy electricity cost-based contracts from local 
providers. With all assesses, providers managed and passed all tariffs and costs to 
their customers. 

Later, developed electricity markets faced a new challenge in the competitive 
markets that have allowed any energy supplier to buy electricity and natural gas. 
Subsequently, utility costs changed from cost-based to market-based tariffs that pro- 
vide end customers multiple options for the same utility based on different rates. On 
the suppliers’ side, this competitive market developed a variety of risks such as the 
fluctuation of fuel prices and electricity prices and the uncertainties of renewable 
energy resources. Moreover, on the end user’s side, energy consumption is the main 
risk because of the modernization of customers’ lifestyles. This factor has a massive 
challenge of the uncertainty of customer loads and peak demands. 

With these risk factors and huge uncertainties of fuel prices, renewable energy, 
and demands, accurate load forecasting has become an essential technique for energy 
market participants such as providers and end users. In addition to these risk factors, 
the importance of accurate predictions is based on several other factors. For examples, 
addressing the electrical demand and determining the peak time are essential reasons 
for providers, and avoiding high electricity prices and reducing energy consumptions 
are crucial reasons for end customers. 
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2.1 Electrical Load 


The electrical load at the distribution level is oscillatory and subject to change because 
human activities follow the daily, weekly and monthly event cycles. For instance, the 
load is generally higher in the daytime and early evening, but it is lower in the late 
evening and early morning. This means that every electrical appliance or light bulb 
that is switched on or off by customers can directly affect the electrical load seen on 
the distribution feeder. In general, customers buy electricity from providers to power 
their end-product. Therefore, the distribution system exists to deliver demand energy 
to customers in the form of electrical appliances and equipment, lighting, heating, 
and cooling as well as other demands in the commercial and industrial sectors. The 
distribution system of the smart grids must satisfy customer needs in order to deliver 
a high quality of service. 

At the same time, electricity capacity is the maximum electric power generated 
by a specific energy resource under ideal conditions. The capacity represents the 
demand to have adequate resources to ensure satisfying the load peaks at all times. 
The generation capacity is commonly measured in kilowatts (kW) or megawatts 
(MW). For example, if a wind farm power plant produces 6% (2 MW) of the local 
generation capacity, this does not mean that it contributes to the utility with 2 MW 
under all conditions but it is under ideal conditions, and it is not the necessary actual 
amount. Indeed, the electricity generation is the amount of energy produced for a 
specific period of time, and it is commonly measured in kilowatt hours (kWh) or 
megawatt hours (MWh). For example, if the wind farm generation plant runs at its 
maximum capacity for three consecutive hours, the wind farm plant will produce 6 
MWh of energy. If the wind farm runs at only half of its maximum capacity for these 
three hours, it will produce 3 MWh of energy. Generally, many energy resources do 
not operate at their maximum capacity all the time. Therefore, the produced energy 
may vary according to the conditions at the power plants. 

Accordingly, the electrical load demand is a trade-off between the high quality of 
service and electricity generation capacity. Besides, the uncertainty of fuel, renewable 
energy, and actual load demands are substantial risk factors which are considered 
on the suppliers’ side. Therefore, energy suppliers need adequate planning models 
and efficient forecasting models to determine the actual loads and satisfy customer 
demands. 


2.2 Load Forecasting 


Load forecasting is a technique usually used by energy suppliers to predict future 
energy consumption to meet the load demand and supply balance in the genera- 
tion, transmission, and distribution markets. Household owners also use it, building 
managers in the commercial sector or energy supervisors in the industrial sector to 
meet their energy requirements and build their bidding strategies. Therefore, the load 
forecasting strategy is indispensable for all active energy market players. Generally, 
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Fig. 3 Different types of load forecasting. The time horizons of each category is illustrated with 
the purpose of the application. The average time interval of the STLF is days, the MTLF is months, 
and the LTLF is years 


The forecasting technique is used to predict load, electricity price, fossil fuels, wind 
power, and solar power. In this chapter, we will focus on the load forecasting in the 
literature review and our case study model. We will elaborate on the deep learning 
methods applied for load forecasting. 

Different categories of load forecasting differ in the time horizon perspective. 
These load forecasts categorize the purpose of the prediction in the future time as 
illustrated in Fig. 3. We define the three main categories and their objectives as 
follows [3, 4]: 


— Long-term load forecasting (LTLF): The time interval of this type of forecasting 
lies from five years to decades in the future. The objective of the LTLF application 
is mainly for the generation and transmission systems which aim to plan for the 
future electricity capacity or grid by the size and cost efficient. 

— Medium-term load forecasting (MTLF): The forecasting time interval of this type 
prevails from a month to five years. The purpose of the MTLF is essentially to 
plan for near future power plants and show the dynamics of the smart grid. 

— Short-term load forecasting (STLF): This type handles time horizons of a sin- 
gle hour up to a couple of weeks. The STLF is necessary for the scheduling of 
power plants. Besides, the applications of this type of forecasting include real-time 
control, energy transfer scheduling, economy dispatch, and demand response. 
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2.3. Influential Factors of Load Forecasting Models 


The elemental purpose of load forecasting is to predict future load patterns for cost 
saving and better planning and operation. The prior knowledge of the influential fac- 
tors that affect the load patterns is a substantial key for accurate load predictions. The 
different influential factors on the electrical load and energy demand were identified, 
researched and utilized in different papers in the literature [5—7]. These factors are 
difficult to distinguish certainly due to different types of time forecasting models 
which may influence the STLF, MTLF, and LTLF. The important factors that should 
be considered while modeling load forecasting, are classified as follows: 


— Time factor: The electrical load varies with respect to customers activities. In 
a daily load pattern or energy consumption, it is worth noticing that the higher 
energy demands at certain timings. In general, the load demand is higher in the 
day time than the night time. For instance, industrial and commercial energy con- 
sumptions are higher at working times while residential energy consumption is 
higher at evening times. The working hours and working days are crucial because 
the variation in load patterns. The early working hours are less consumption than 
the middle working hours. Similarly, weekends are less energy consuming than 
working days. The energy consumption in holidays is more difficult to forecast 
because of the infrequent activities. The load curve in each time resolution such 
as daily, weekly, monthly or yearly is periodic but variant and inconsistent. The 
load curve always has the highest time of the day, the day of the week, the week 
of the month, and the month of the year. 

— Weather factor: Significant weather conditions are influential factors on load fore- 
casting. The weather conditions include temperature, humidity, wind speed, and 
cloud cover. These conditions can be considered mostly for the STLF modeling. 
The high temperature in the summer season can affect the customers’ comfort, 
and they will consume more energy for cooling. Likewise, the low temperature in 
the winter season can affect the customers’ feeling, and they will use more energy 
for heating. Therefore, a strong positive correlation between high temperature 
and energy consumption in the summer season and a strong negative correlation 
between low temperature and energy consumption. The humidity is a relative 
weather condition to the severity feeling of high temperature and low temperature. 
Hence, customers increase their energy consumption during significant humidity 
and temperature conditions. Therefore, humidity is a considerable component for 
load forecasting. 

— Customer factor: There are different kinds of customers who consume energy 
for different purposes such as residential, commercial and industrial customers. 
The energy consumption activities differ from one kind of customer to another. 
However, the load curves are slightly similar for the majority of one kind. The 
customer factor mainly depends on the size of the property, the type of property, the 
number of occupants, and the amount of electrical equipment. However, electrical 
equipment usage and energy consumption may vary from one consumer to another 
within one kind. 
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Table 1 Different influential factors of load forecasting models 


Time Load patterns in: minutely, hourly, daily, weekly, monthly, STLE and MTLF 
seasonally and yearly 


Weather Significant conditions in: temperature, humidity, wind STLE and MTLF 
speed and cloud cover 


Economy | Increase in: fossil fuel price, electricity price, industrial LTLF 
and commercial development and population growth 


Other Large social events, sport events and industrial experiments | LTLF 


— Economy factor: The economy factor plays an important role for load forecast- 
ing; especially, for the LTLF models. The economic factors include fossil fuels 
price, electricity price, industrial and commercial development, and population 
growth. The fuel prices can influence load curves by increasing the electricity 
price which impacts the customers’ consumption. Likewise, low electricity prices 
increase energy consumption, hence, the load demand increase. The industrial and 
commercial development at a particular area increases the energy consumption as 
well as the increasing of population growth in a particular area. 

— Other factors: Other factors can affect the load demands which are mainly non- 
periodic occasions and events that consume large energy consumption such as large 
social events, sports events, and industrial experiments. These types of high energy 
consumptions are difficult to predict resulting in a high average of prediction errors 
in the forecasting model. 


In short, these factors may not influence each load forecasting model in the same 
way, but they are essential for consideration. Thus, the most critical factor is the time 
which directly impacts on the end customers activities. In addition, temperature and 
humidity are relevant influential factors for the load forecasting because of human 
feelings and activities that directly response to weather conditions with heating and 
cooling. Accordingly, Table 1 summarizes the different influential factors and their 
use in each load forecasting model. 


3 Forecasting Modeling Issues in Smart Grids 


In this section, we will highlight the existing issues of load forecasting, review 
the existing research of load forecasting extensively using conventional statistical 
methods, machine learning methods, and deep learning methods and narrow the 
gaps of existing research using significant deep learning approaches. First, we will 
take a look at some current general issues of load forecasting modeling. Then, we will 
give a short description of the most commonly encountered methods and highlight 
the advantages and disadvantages of each method. Finally, we will focus on deep 
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learning methods applied for load forecasting, demonstrate their key conceptual and 
algorithmic facets and narrow the gaps of existing methods. We will give an overview 
of their prediction results, the field of studies, locations, scale, the dataset used, the 
model used and the year of publication. 


3.1 General Issues with Load Forecasting Modeling 


In general, good load forecasting models and accurate predictions are vital elements 
to lead for appropriate planning, operation, and control. All load forecasting cate- 
gories are difficult to be modeled over a planning period due to the many challenges 
and influential factors as mentioned above. Thus, the accurate prediction is still chal- 
lenging due to the following difficulties: 


— The large correlation with weather conditions. Sometimes, weather conditions are 
unpredictable, and it turns to an unexpected state. 

— The large variation of energy consumption between customers due to the unpre- 
dictable events and activities. Also, the lack of using smart meters to record energy 
consumption efficiently. 

— Single customer load forecasting is more difficult than forecasting the grid load. 
This difficulty exists because of the lack of large historical data for the single 
customer and the stochastic effects of the customer activities. 

— Non-stationary time series effects of the electric load. These effects do not have a 
constant mean and variance. 

— The high volatility of electrical load due to the change of seasonality and time 
factor effects. Sometimes, the same seasons are different from one year to another. 


3.2 Traditional Load Forecasting Models 


Statistical-based models 


So far, various statistical-based techniques applied for load forecasting have been 
investigated, all of them with differing degrees of success. There are conventional 
forecasting models such as similar-day method, exponential smoothing, linear regres- 
sion, multiple regression, Autoregressive Moving-Average (ARMA), and Autore- 
gressive Integrated Moving-Average (ARIMA). Since the scope of this work is 
limited to deep learning-based methods, we will give a short description of some 
statistical-based methods below. 


— Similar-day method: It is one of the naivest methods for load forecasting because 
the approach depends on searching for a similar day in the past. For instance, we 
search for days with similar characteristics in the historical load data and averaging 
them to find a forecasted day result. This method is fast and easy to get the overall 
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load behavior; however, it lacks the acquisition of the grid expansion and structural 
changes. 

— Exponential smoothing: This approach depends on smoothing the time series 
through the use of the exponentially weighted moving average of the past load 
observations. This approach is robust and accurate; however, it lacks the accom- 
modation of more than one seasonal pattern. This approach was applied for load 
forecasting in [8]. 

— Linear and multiple regressions: Regression is a statistical tool to estimate the 
relationship between a dependent variable and independent variables. It helps 
to understand how the dependent variable is related to the change of independent 
variables and which one 1s more related to the dependent variable. Linear regression 
is a simple regression method that accommodates one dependent variable and 
one independent variable and predicts the dependent variable as a function of the 
independent variable. It finds the best fitted straight line between the points of these 
variables and it is called the regression line. Multiple regression is an extension 
of the simple regression and it accommodates one dependent variable and more 
than one independent variables. The dependent variable could be the measured 
load data within a certain period of time and the independent variables could be 
any influential factor e.g., a day of the week, temperature, population size, etc. 
The regression approaches are easy to be modeled and calculate, however, they 
are sensitive to data outliers and linearity assumptions. 

— ARMA and ARIMA: The autoregressive (AR) model and moving average (MA) 
model helps to understand the correlation between dependent and independent 
variables and predict future values of the dependent variables. The AR model 
uses the association between the observations and its own lagged values and the 
MA model uses the moving average for lagged observations and finds residual 
errors. The advance model ARIMA includes integrated (I) that subtracts the current 
observations from past observations to make the time series stationary. The models 
usually referred to their level of orders such as ARMA (p, q) and ARIMA (p, d, 
q). The lag order, which is the number of lag observations included in the model, 
is denoted as p. The degree of difference, which is the time of raw observations 
differenced, is denoted as d. The moving average order, which is the size of the 
window, is denoted as q. Therefore, the success of the models depends on the 
developer experience and skills to choose the right orders. This approach was 
widely used for load forecasting; especially, for the STLF in [9, 10]. 


Machine learning-based models 


Ona similar note, machine learning-based techniques are widely known for their abil- 
ity to accommodate complex systems, non-linear models and non-stationary time 
series. These advantages advance the machine learning-based models over tradi- 
tional statistical-based models that must have prior influential factors, knowledge, 
and modeling experience to achieve accurate load forecasting. The machine learn- 
ing-based techniques are self-learning methods that can classify and predict the input 
and output data automatically through the algorithms. Besides, there is no necessary 
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experience and knowledge of the forecasting model to achieve accurate load fore- 
casting. Relevant findings of machine learning-based techniques concerning load 
forecasting problems in the literature revealed acceptable prediction errors; how- 
ever, these attempts did not provide better performance than deep learning-based 
methods. Since this work concentrate on deep learning-based techniques applied 
for load forecasting, we will give a short overview of some machine learning-based 
techniques such as decision tree regression, support vector regression and artificial 
neural networks below. 


— Decision tree regression: The approach builds a predictive regression model based 
on partitioning the data into subsets that form a tree structure. It partitions the data 
into smaller and smaller subsets while the decision tree increase in each fold. 
The tree structure has decision nodes and leaf nodes that consist of two or more 
branches and one numerical target, respectively. The branches in the tree represent 
the attribute values of the observations. The decision tree regression has continuous 
values of the target variables. This approach was implemented for the LTLF in [11] 
and energy demand modeling in buildings in [12]. 

— Support vector regression: It is a supervised learning method that represents the 
data observations into points in the space of the data categories. The mapped points 
are separated and divided by a hyperplane between the categories. Ideally, the 
hyperplane should be large and clear. The method was utilized to load forecasting 
in [13-15]. 

— Artificial neural networks: The approach is one of the widely-used techniques 
in machine learning. It is brain-inspired that mimics the process of human self- 
learning. The method architecture consists of one input layer, one hidden layer, and 
one output layer; however, when it has more than one hidden layer, it is considered 
as a deep neural network or deep learning. Generally, the connections between the 
artificial neurons are called edges which have connection weights of neurons. The 
learning process is computed by the weights and non-linear activation function. 
This method was widely utilized for load forecasting in [16—19, 20]. 


Deep learning-based models 


Advance machine learning techniques are called deep learning because they have 
deeper neural networks that compute more complex systems using multiple layers of 
non-linear functions. The advantages of deep learning models over machine learning 
are more complex feature extractions, less modeling, and more accurate predictions; 
however, its computational cost is higher than machine learning and statistical mod- 
els. The top records in the accuracy of deep learning-based models were found in 
many important problems such as face detection, image processing, recommender 
systems, natural language processing, and time series predictions. Although few 
efforts are conducting deep learning-based models for load forecasting, for exam- 
ple, multilayer perceptron, convolutional neural networks, recurrent neural networks, 
long short-term memory, and gated recurrent unit and produced more accurate pre- 
dictions, most these attempts were based on conventional implementations. Since 
this thesis concentrates on deep learning-based techniques applied for load fore- 
casting, we will give a short review of some deep learning-based methods and their 
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applications, as well as elaborating on their key conceptual and algorithmic facets 
for load forecasting application. 


— Multilayer Perceptron: Subsequent work with artificial neural networks has shown 

that they are capable of having multi hidden layers and performing many non-linear 
activation functions. This consequent work is called multilayer perceptron (MLP) 
that includes one input layer, more than one hidden layers, and one output layer. 
The approach is often used for supervised learning problems of classifications and 
predictions. Since it is a class of artificial neural networks, it has the same prop- 
erties of the learning process, using connection weights and non-linear activation 
functions. The training process of the method involves weights and biases of the 
model to minimize the prediction error. This process consists of a forwarding pass 
that computes relative prediction errors and backward pass that computes gradient 
descent of the relative prediction errors. The backward pass is usually computed 
using Backpropagation algorithm that finds the partial derivatives of the activation 
function with respect to weights and biases. In general, the MLP method aims to 
self-learn the complex model and minimize prediction errors. The approach was 
employed in load forecasting [21] and in electricity price forecasting [22, 23]. 
In the context of load forecasting, a multilayer perceptron is trained on input 
temporal data X(t) in order to predict a target load L(t). The X(t) can be any 
univariate time series or multivariate time series that includes influential factors 
plus historical load data. The L(t) can be a temporal shift of the univariate time 
series or multivariate time series. At time step t, the input layer processes the 
features of X(t) € IR; hence, the temporal vector is as follows: 


X = {Xo, X1,... Xr} (1) 


where X(t) is the historical data at a time ¢ andt ¢€ {1,2,...7}. The load fore 
casting output L(t) is computed as follows: 


L(t) = f(W x X(t) +d), (2) 


where f(.) denotes the activation function, usually implemented by a sigmoid 
function, a hyperbolic tangent or a rectified linear unit. The W and b are the 
weight matrix and the bias vector, respectively. 

— Convolutional neural networks: Similarly, convolutional neural networks (CNN) 
are somewhat similar to MLP by having one input layer, more than one hidden 
layers, and one output layer. However, the hidden layers in this approach are con- 
volutional layers that apply cross-correlation computation of the inputs neurons. 
The approach is applied widely in various applications including image recogni- 
tion, video recognition, recommender systems, and natural language processing. 
This method is commonly used for processing grid data topology which includes 
a two-dimensional grid of pixels for image data construction [24]. However, the 
construction of the time series data is one-dimensional grid at a time interval. Thus, 
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load forecasting applications utilize one-dimensional CNN while the mathemat- 
ical convolution operation is employed in at least one of the hidden layers [24]. 
This approach was used for load forecasting [25—28]. The one-dimensional con- 
volutional neural network is described with cross-correlation function or sliding 
dot product as follows [24]: 


SQ) =(X* WH) =I X@)Wr— a), (3) 


L(t) = f(WL x S(t) + by), (4) 


where X denotes inputs, W is the weighting function (kernel filter), a is the 
weighted average, and S is the convolutional output which is called feature map for 
the continuous time t. The L(t) denotes the load forecasting outputs, f(.) denotes 
the activation function, the W; denotes the hidden to output weights and the by is 
the hidden to output bias vector. 

— Recurrent neural networks: On the same subject, recurrent neural networks (RNN) 
are another special type of MLP that have one input layer, more than one hidden 
layer, and one output layer. However, these hidden layers have recurrent connec- 
tions that make them suitable for sequential computation. The recurrent connec- 
tions are from the output to the input in the hidden layer. It 1s commonly applied for 
time series sequence because it has a memory state in its architecture that assists 
sequential data to be processed. The approach was utilized for multiple load fore- 
casting problems, for example in [29-31]. The mathematical representation for 
the RNN is defined as follows: 


h(t) = f((Wh x h(@ — 1) + bn) + (Wx x X(t) + by)), (5) 


L(t) = f(Wi x h(t) + bz), (6) 


where f(.) denotes the activation function, X(t) denotes inputs, L(t) denotes 
the load forecasting outputs, h(t) denotes the hidden state, h(t — 1) denotes the 
previous hidden state, Wy denotes the input to hidden weights, W;, denotes the 
hidden to hidden weights, W; denotes the hidden to output weights, b, is the 
hidden to hidden bias vector and b; is the hidden to output bias vector. 

— Long short-term memory: Generally, long short-term memory (LSTM) works 
essentially in the same way of the RNN, but it employs more gates for the recur- 
rent neurons called the forget gate, update gate and output gate and more internal 
processing unit called the cell. Each gate has a specific function in the cell. For 
example, the forget gate discards unwanted information from the previous state, 
the updated gate updates the state with new candidates, the cell filters the current 
state and finds the wanted and unwanted information and the output gate selects 
the necessary information from the cell output. This approach received attention 
due to its superior performance in accurately modeling. It was used widely in load 
forecasting in [25, 32—35]. Since the RNN method employs only one non-linear 
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function, the LSTM technique imposes five different non-linear functions at the 
same cell. In the context of load forecasting, the mathematical representation is 
defined as follows: 


ip = 81(Win x X(t) + Wim x L(t —1)+ Bi), (7) 
fi = 81(Wrn X X(t) + Wem X L(t — 1) + dy), (8) 
0: = 81(Won X X(t) + Wom x L(t — 1) +5), (9) 
U = go(Wun x X(t) + Wom x L(t — 1) + by), (10) 

C(t) = f,x C(tt—1) +i, x U, (11) 
L(t) = 0; X g2(C(t)), (12) 


where g; denotes the sigmoid activation function, g2 denotes the hyperbolic tan- 
gents activation function, X (ft) is the input vector, i, is the input of the input gate 
where the subscript means input, /; is the input of the forget gate where the sub- 
script means forget, 0; is the input of the output gate where the subscript means 
output, U is the update signal, C(t) is the state value at the time ¢ of computation 
and L(t) is the output of the cell for load forecasting. W,.) and b,,) are the weight 
matrices and bias vectors, respectively. The weights correspond to the current state 
values of a particular variable are denoted as W,) , and previous state signal as 
W),m: 

— Gated recurrent unit: Similarly, gated recurrent unit (GRU) is another recent and 
popular gated architecture of the RNN that adaptively captures dependencies and 
features of time series. It also solves the problem of vanishing gradient descent. 
The main difference between this approach and LSTM is that it has a single update 
gate and a reset gate. The update gate z, combines the forget gate and the input gate 
of the LSTM method to control the unwanted and wanted information. The reset 
gate r; reconstructs the cell memory with the next processed input. This approach 
outperformed the LSTM in [36]. Also, it was utilized for load forecasting problems 
in [37, 38]. The mathematical representation for load forecasting context 1s defined 
as follows: 


cr = 81(Wen X X(t) + Wem X L(t = 1) +b-), (13) 
r= 81(Wrn X X(t) + Wem X Lt 1) +b), (14) 


U = go(Wun x X(t) + Wom X [rr © L(t — 1)] + bu), (16) 
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Lit)=dA—-z,)Oh(t -—1) +z, 0U, (17) 


where g; denotes the sigmoid activation function, gz denotes the hyperbolic tan- 
gents activation function, X(t) is the input vector, L(t) is the output vector of 
load forecasting, U is the update signal, and © is element-wise multiplication. 
W,, and b,, are the wights’ matrices and bias vectors, respectively. The weights 
correspond to the current state values of a particular variable are denoted as W,) » 
and previous state signal as W,) m. 

Since these approaches are subcategories of the RNN, they are appropriate tools for 
sequential problems such as time series prediction and load forecasting. Addition- 
ally, they solve the problem of vanishing gradient descent in the RNN by avoiding 
any bias of recent observations. 


3.3 Traditional Load Forecasting Models 


Modeling deep learning-based paradigm for load forecasting is not an easy task. 
There are a large number of choices and parameters that have to be adequately 
made to achieve the appropriate modeling and accurate predictions. However, few 
guidelines can help developers to overcome these challenges. We will go over these 
guidelines in the next section, but now we classify these challenges as the following: 


— Data scale: The data scale of the historical load is a major influential factor that 
affects the deep learning-based modeling. This factor can influence the model 
predictions because of any of the following: 


e Outliers and missing values: If the data scale is small, even few outliers or few 
missing values will form a significant alteration to the model. 

e Train and test data: To evaluate the model properly, the model splits the data 
into train and test data. Each data has enough portion of data observations to 
perform the proper training and testing. However, if the original data scale is 
small, each data may not have enough observations to perform properly. 


— Data preprocessing: Usually, the data preprocessing is an important step that has to 
be conducted before the data is ready as an input to the deep learning-based model. 
This step helps to manage the forecasting model problems and avoid excessive 
volatility of the data. 

— Designing the deep learning-based model: Selecting an appropriate deep learning 
method is the first step in designing an adequate model. There are many architec- 
tures of deep learning techniques that were utilized for time series predictions and 
electricity load forecasting. 

— The appropriate number of hidden layers and neurons: Determining the size of 
hidden layers and neurons may be the most challenging task in designing deep 
learning-based models. This issue arises because the model needs to be fit and 
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have less computational cost. The small size of hidden layers and neurons may 
lead the model to inflexible performance with data. On the other hand, the large 
number of hidden layers and neurons may increase the chance of overfitting the 
model with the data. Besides, the larger size of the hidden layers and neurons is 
more computational complexity. 

— Model overfitting and validation: Overfitting issue arises when the model learns 
the details of the training data well, however, it lacks its performance when it is 
tested with new data which is the unseen testing data. Although examining the 
model for overfitting is a good strategy for determining the excellent forecasting 
model, validating the model using other tests is necessary to achieve a sufficient 
deep learning-based model for load forecasting. 

— Offline modeling: Generally, deep learning-based models are designed for offline 
training and testing. This technique learns the entire data at once and evaluates 
the model with a portion of the data that is testing data. On the other hand, online 
modeling is a dynamic model that learns at each time step of the brand-new data 
and update the predictive model according to the latest data. 


4 Solutions and Recommendations 


Since most of the previous research in load forecasting focuses on small historical 
datasets and uses conventional modeling approaches, there is little emphasis on using 
big data, hybridizing different significant deep learning-based models and finding 
optimal deep learning parameters by using different solutions such as evolutionary 
computation algorithms. An initial analysis was able to find evidence of using deep 
learning methods that are powerful techniques for precise time series predictions. 
We hypothesize that hybridizing two or more deep learning-based methods for load 
forecasting in smart grids could produce more accurate prediction and form the 
groundwork for explicitly broad load forecasting models. Besides, finding optimal 
deep learning parameters using different evolutionary computation methods could 
form the preliminary search space of deep learning parameters. 

In this section, we will present some guidelines and solutions for the issues of 
modeling deep learning-based paradigms. We will elaborate a case study of different 
promising load forecasting models using hybrid deep learning methods and compare 
their results with existing load forecasting approaches. 


4.1 Guidelines and Solutions to Modeling Issues 


Since the deep learning-based models are sensitive to a large number of choices and 
parameters, finding the appropriate model is the point at issue. Therefore, an element 
key in finding an adequate model is to follow some of the following guidelines 
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and successful solutions that have been utilized in the literature. Few guidelines 
and recommendations can help model designers to achieve their goal quickly, but 
not every guideline will work efficiently with each load forecasting model. These 
guidelines and recommendations are listed according to the modeling issues that we 
mentioned earlier: 


— Data scale: Nowadays because of big data of electrical loads and prices in smart 
grids, load forecasting designers need new approaches and technologies in order to 
achieve their goals. In general, the big data or a large-scale dataset has a large vol- 
ume of information and complex data structures that cannot be processed with tra- 
ditional load forecasting models. The large-scale dataset helps utility providers and 
energy management operators to analyze their systems comprehensively. Besides, 
they can design their forecasting models with high computational techniques such 
as deep learning models that perform well with big data by using batches for 
training. The primary objective of this chapter is to design a deep learning-based 
forecasting model using a large-scale dataset. 

— Data preprocessing: Data preprocessing refers to all processing techniques on the 
raw data before it is fed to the deep learning model. Some preprocessing techniques 
are necessary to be applied before the model learns the dataset. These preprocessing 
techniques help the model perform better and consume less computation time. We 
list some of the common data preprocessing techniques used for deep learning 
methods below: 


e Data cleaning: Since deep learning-based models are sensitive to defective 
samples in the dataset, so data cleaning technique is essential for better deep 
learning performance. The technique may include removing or fixing missing 
data and outliers. 

e Data normalization: Normalizing the dataset features avoids the problem of 
dominating the large number ranges of attributes and helps the model to per- 
form accurately. While most of the electrical load datasets consist of different 
value scales and various quantities, for example, load profiles, weather data, 
and fuel prices, normalizing these values before feeding to the deep learning 
model provides easier learning and less computation cost. The mathematical 
representation of data normalization is as follows: 


X(t) = A(f) — min (17) 
max — min 


where X(t) is the original value of the input dataset, X’(t) is the normalized 
value scaled to the range [0, 1], max is the maximum value of the features, and 
min is the minimum value of the features. 


— Designing the deep learning-based model: Selecting an appropriate deep learning 
architecture is the first step for the load forecasting model. Since various research 
papers applied deep learning methods for load forecasting, reviewing these papers, 
analyzing their techniques and comparing their results is an important step in 
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finding the best technique. Sometimes a successful technique performs well in a 
particular model but does not perform in the same way in another dataset. Thus, 
finding the best model is a challenging task for load forecasting modelers. In most 
of the reviewed papers, the authors made their choices based on their trials and 
empirical tests. 


e The appropriate number of hidden layers and neurons: In general, the number 
of hidden layers is less complicated and influential in the deep learning model 
performance than the number of hidden neurons in the hidden layer. A single 
hidden layer sums the input weights and multiplies them by a non-linear activa- 
tion function. An extra hidden layer can smooth and approximate the mapping 
features of the previously hidden layer resulting in a better prediction output, 
but not in every case. The number of hidden neurons is more influential on the 
final prediction output than the number of hidden layers. A large number of neu- 
rons leads to an overfitting problem, and a small number of neurons leads to an 
under fitting problem. There is no rule of choosing the number of hidden layers 
and the number of hidden neurons, but there are some techniques such as trial 
and error, pruning techniques, and evolutionary computation algorithms. The 
ability of evolutionary computation algorithms is that they can evolve the deep 
learning-based model and optimize the number of hidden neurons, for example, 
genetic algorithms in [39, 40]. 


— Model overfitting and validation: To improve the model validation and avoid the 
overfitting problem, applying a cross-validation technique is an essential task after 
modeling. The cross-validation splits the datasets into k-fold subsets to estimate 
the general performance of the prediction model and gives an insight on how the 
model generalizes the independent variables throughout the datasets. The method 
repeats the process of splitting the dataset into training and testing portions for 
k-times where the size of the testing data portion remains fixed but moves through 
the original dataset; the remainder used as a training dataset every fold. 

— Online modeling: Applying a parallel computational technique, such as MapRe- 
duce, can reduce the time consumption of the deep learning-based model using 
one of the computational frameworks, for example, Apache Hadoop and Apache 
Spark. Also, applying the parallel computation technique to the proposed model 
can provide a real-time prediction paradigm, for example, real-time power fore- 
casting, that can train the historical inputs variables offline and update and test the 
current input variables online. 


To model load forecasting efficiently using big data and deep learning methods, 
four crucial steps have to be implemented to obtain an accurate prediction. These steps 
start with determining the big data that may include load data and other influential 
factors. Then, the data preprocessing consists of the data cleaning, normalization 
and splitting the data into training and testing sets. Selecting a deep learning method 
that may be suitable for the forecasting problem and designing the deep learning 
model properly is the third step. Finally, the forecasting results may be visualized 
using visualization tools to evaluate the model performance and prediction. If the 
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Fig.4 An overview of deep learning-based model procedure. The procedure consists of four main 
segments. The first segment is selecting an electricity load dataset which may include influential 
factors data. The second segment is the data preprocessing using a CPU-based computer. This 
means that the preprocessing does not need high computation processing. The third segment is the 
predictive model which is one of the deep learning algorithms. This step need a high computational 
tool such as aGPU-based computer. The last step shows the forecasting results and prediction errors. 
This step needs visualization tools to visualize the outputs such as prediction graphs, training and 
testing performances, and comparative charts 


prediction errors are not reasonable, the deep learning model could be modified or 
changed to improve the prediction accuracy. Besides, the model can be compared with 
other deep learning methods. Figure 4 demonstrates the general modeling procedures 
of load forecasting. 


4.2 Case Study 


In this case study, we consider a big dataset in the form of power consumption in a 
commercial building. First, we perform some analysis and preprocessing techniques 
in order to understand the nature of the time series dataset and make it ready for 
the forecasting model. Then, we set up a hybrid deep learning-based model for 
STLF in an hour-ahead, a day-ahead and a week-ahead forecasting. We compare the 
forecasting results with traditional statistical-based models, machine learning-based 
models and deep learning-based models. 


Commercial building data 


The large-scale dataset of power consumption in a commercial building is publicly 
published in [41]. The time series dataset consists of one year in 2010 with fifteen 
minutes’ resolution. The dataset includes the power consumption in (kW) and outdoor 
temperature in (F). The chosen building in this study is building | which is a retail 
building in Fremont, CA. Figure 5 shows the variation of average power consumption 
for 2010. 


Hybrid deep learning-based models 


Referring to the modeling procedure which is shown in Fig. 7 our modeling stands 
for four main parts including preprocessing and hybrid deep learning-based model. 
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Fig.5 The load profile in kilowatts (kW) of the averaged daily power consumption of a commercial 
building for one year 


The data preprocessing segment prepares the input features collected in the dataset 
to the hybrid deep learning-based model. 

There are three main steps in the preprocessing segment where the first depends 
on normalizing the original datasets as in (17), the second is preparing the input 
data for the supervised learning technique, and the third is splitting the normalized 
supervised dataset into three parts, the training, the validating and the testing datasets. 
To evaluate the performance of the proposed model accurately, the training data is 
used for the training process of the approach, the validating data is used to validate 
the model performance, and the testing data is used just for testing the forecasting 
process using unseen data. 

The hybrid deep learning-based model in the third step is based on a coder and 
decoder which are the CNN model and LSTM model, respectively. The input of the 
CNN-LSTM is the record of power consumption dataset of the commercial building 
after the preprocessing analysis, and the output is the power consumption forecasting 
for the next day and next week. It is unlike traditional CNN or LSTM models because 
hybridizes these two superior methods to improve the learning process. The first half 
is CNN, which is utilized to extract the input features and encode them as in (5) 
and (6), and the second half is the LSTM, which is used to analyze the extracted 
features as in (7)—(12) from the CNN and decode the features to predict the power 
consumption for the next period of time. The approach includes two layers of the 
one-dimensional CNN to improve extracting the input features, one layer of the one- 
dimensional pooling to collect the extracted features, and two layers of the LSTM to 
analyze the collected extracted features and predict the output as shown in Fig. 6. 
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This CNN-LSTM model is implemented using Python 2.7, the Keras deep learn- 
ing framework [42], and the scikit-learn framework [43]. We configured the model 
network with the same parameters and activation functions shown in Fig. 6. Because 
the CNN model has a few choices of the number of neurons, we selected the number 
of hidden neurons as 64 neurons in the first convolution layer, 32 neurons in the 
second convolution layer and 16 neurons for the pooling layer. Thus, the decoder 
segment reverses the number of hidden neurons with 32 and 64 hidden neurons. The 
applied optimizer function is Adam, and the applied loss function is the mean square 
error. The total number of training epochs is 1000. 


Results and discussions 


To evaluate the forecasting performance results, we utilized 70% from the original 
datasets to train the approach model, 20% from the original dataset to validate the 
performance of the model and the last unseen 10% from the original dataset to test the 
model predictions. The conventional metrics used to evaluate the predictive models 
are utilized to evaluate the forecasting in our experiments. The traditional metrics 
such as the root-mean-squared error (RMSE), and the coefficient of variation of 
the RMSE, known as the normalized root-mean-squared error (NRMSE), and mean 
absolute percentage error (MAPE), are defined as follows: 


RMSE = Jd, (L(t) — X(t). (18) 
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An hour-ahead forecasting of power consumption in a commercial building 
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Fig.7 The energy consumption forecasting graph results of the CNN-LSTM model. The forecast- 
ing curves of one hour-ahead are represented in dashed lines that follow the original load profile 
line curves 
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where 7 represents the total number of time steps in the time series dataset, L(t) is 
the predicted output of the time series, X (ft) is the real measured time series in the 
dataset, and X is the average of the actual values of power consumption. 

As shown Fig. 7, the forecasting result is in dashed line curves with triangles and 
the original data is shown with line curves. It is worth noticing, the forecasting curves 
are almost consistent with the original curves except for several abrupt deviation 
points. This represents the effectiveness of the CNN-LSTM forecasting model. 

Applying the cross-validation method to the CNN-LSTM model produces a robust 
averaged estimation of the forecasting when each observation in the dataset 1s used 
for training and testing at each fold. We utilized 10-fold cross-validation in our fore- 
casting model using a time series cross-validator [43]. By applying this method, we 
avoided the overfitting problem in our model and validated the prediction model by 
testing unseen data at each fold. Besides, we compared our model with traditional sta- 
tistical, machine learning and deep learning models as in Table 2. It is worth noticing 
that the best forecasting model performance was the CNN-LSTM for the one hour- 
ahead forecasting and one day-ahead forecasting. Also, the one-dimensional CNN 
model performed better than other models. The LSTM and GRU models performed 
in a similar way for both time prediction resolutions. The GRU performance was a 
little bit better than the LSTM in both forecasting steps. The decision tree model was 
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Table 2 Comparison of an hour-ahead forecasting and a day-ahead forecasting between tradi- 
tional models and hybrid deep learning-based model. The prediction errors are in the percentage of 
NRMSE and MAPE and represent the average of cross-validation error 


Model An hour-ahead forecasting A day-head forecasting 


NRMSE Go MAPE (i 
ARIMA 9.979 4.644 
Decision tree 14.410 5.054 
KNN 11.242 4.465 
MLP 9.017 3.966 


the worst forecasting performance in our case study. Therefore, the CNN-LSTM 
model showed its superiority of forecasting since it is hybridizing two successful 
deep learning-based models. 


5 Conclusions and Future Trends 


The infrastructure of the energy market has changed dramatically in recent years. 
With the development of the smart technologies implemented in the grid, the intro- 
duction of renewable energy resources and distributed energy resources, energy mar- 
ket participants are in need to update their methodologies for planning, operating, 
and controlling electrical loads and energy consumptions. This chapters focused on 
the deep learning application applied to load forecasting in smart grids; thus we 
gave a snapshot of the background of smart grids and electrical load patterns. We 
discussed the importance of the load forecasting in the energy market and the factors 
influencing the load forecasting modeling. 

In this overview, we reviewed traditional load forecasting methods such as statis- 
tical methods, machine learning methods, and deep learning methods. We explored 
the key conceptual and algorithmic facets of deep learning methods applied to load 
forecasting, and discussed the general issues of deep learning modeling. Also, We 
performed a case study of big data and hybrid deep learning-based model for a com- 
mercial building load forecasting. We found the CNN-LSTM model outperformed 
other traditional deep learning models. 

From the literature review, we conclude that no specific deep learning model 
outperforms other deep learning models for every forecasting problem. Thus, the 
best architecture choice depends on the forecasting task and challenges. The LSTM 
and GRU models are close to each other in their performances because they are 
subcategories of RNN and suitable for sequential problems. They usually accomplish 
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accurate predictions compared with traditional models. As a suggestion, hybridizing 
two superior models from the literature is an excellent technique to achieve a good 
performance in load forecasting modeling as our results in the case study show. 

Reinforcement learning is a possible future research subject of load forecasting 
and energy management. This method can allow the energy model to adjust the 
parameters and reduce energy consumption automatically. Furthermore, by integrat- 
ing big data that includes many electrical load features, not only traditional aggregated 
load and energy consumption can be forecasted, but a comprehensive interactive pre- 
diction of many energy systems such as operational energy parameters and electric 
vehicle charging can be achieved. Besides, the implementation of intelligent predic- 
tive systems in actual systems can increase the smart grid development and enhance 
practical future intelligent applications. 


References 


1. Gungor, V.C., et al.: Smart grid technologies: communication technologies and standards. IEEE 
Trans. Ind. Inf. 7(4), 529-539 (2011) 

2. Deng, R., Yang, Z., Chow, M., Chen, J.: A survey on demand response in smart grids: mathe- 
matical models and approaches. IEEE Trans. Ind. Inf. 11(3), 570-582 (2015) 

3. Almalaq, A., Edwards, G.: A review of deep learning methods applied on load forecasting. In: 
2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), 
pp. 511-516 (2017) 

4. Raza, M.Q., Khosravi, A.: A review on artificial intelligence based load demand forecasting 
techniques for smart grid and buildings. Renew. Sustain. Energy Rev. 50, 1352-1372 (2015) 

5. Khatoon, S., Ibraheem, Singh, A.K., Priti: Effects of various factors on electric load forecasting: 
an overview. In: 2014 6th IEEE Power India International Conference (PI}CON), pp. 1-5 (2014) 

6. Fahad, M.U., Arbab, N.: Factor affecting short term load forecasting. J. Clean Energy Technol. 
2(4), 305-309 (2014) 

7. Feinberg, E.A., Genethliou, D.: Load Forecasting. In: Chow, J.H., Wu, F.F., Momoh, J. (eds.) 
Applied Mathematics for Restructured Electric Power Systems: Optimization, Control, and 
Computational Intelligence, pp. 269-285. Springer US, Boston, MA (2005) 

8. Ji, P, Xiong, D., Wang, P., Chen, J.: A study on exponential smoothing model for load fore- 
casting. In: 2012 Asia-Pacific Power and Energy Engineering Conference, pp. 1-4 (2012) 

9. Amyady, N.: Short-term hourly load forecasting using time-series modeling with peak load 
estimation capability. IEEE Trans. Power Syst. 16(3), 498-505 (2001) 

10. Hagan, M.T., Behr, S.M.: The time series approach to short term load forecasting. [IEEE Trans. 
Power Syst. 2(3), 785-791 (1987) 

11. Ding, Q.: Long-term load forecast using decision tree method. In: 2006 IEEE PES Power 
Systems Conference and Exposition, pp. 1541-1543 (2006) 

12. Yu, Z., Haghighat, F, Fung, B.C.M., Yoshino, H.: A decision tree method for building energy 
demand modeling. Energy Build. 42(10), 1637-1646 (2010) 

13. Chen, B.-J., Chang, M.-W., et al.: Load forecasting using support vector machines: a study on 
EUNITE competition 2001. IEEE Trans. Power Syst. 19(4), 1821-1830 (2004) 

14. Pai, P-F., Hong, W.-C.: Support vector machines with simulated annealing algorithms in elec- 
tricity load forecasting. Energy Convers. Manag. 46(17), 2669-2688 (2005) 

15. Zhu, Z, Sun, Y., Li, H.: Hybrid of EMD and SVMs for short-term load forecasting. In: 2007. 
ICCA 2007. IEEE International Conference on Control and Automation, pp. 1044-1047 (2007) 

16. Park, D.C., El-Sharkawi, M.A., Marks, R.J., Atlas, L.E., Damborg, M.J.: Electric load fore- 
casting using an artificial neural network. IEEE Trans. Power Syst. 6(2), 442-449 (1991) 


Deep Learning Application: Load Forecasting in Big Data ... 127 


ge 


18. 


19. 


20. 


21; 


22. 


2» 


24. 
2S. 


26. 


ay. 


28, 


2). 


30. 


31. 


2 


a3, 


34. 


5D: 


36. 


37. 


Hayati, M., Shirvany, Y.: Artificial neural network approach for short term load forecasting for 
Illam region. World Acad. Sci. Eng. Technol. 28, 280-284 (2007) 

Kandil, N., Wamkeue, R., Saad, M., Georges, S.: An efficient approach for short term load 
forecasting using artificial neural networks. Int. J. Electr. Power Energy Syst. 28(8), 525-530 
(2006) 

Zhang, G., Patuwo, B.E., Hu, M.Y.: Forecasting with artificial neural networks: the state of the 
art. Int. J. Forecast. 14(1), 35-62 (1998) 

Gonzalez, P.A., Zamarrefo, J.M.: Prediction of hourly energy consumption in buildings based 
on a feedback artificial neural network. Energy Build. 37(6), 595-601 (2005) 

Tsakoumis, A.C., Vladov, S.S., Mladenov, V.M.: Electric load forecasting with multilayer 
perceptron and Elman neural network. In: 6th Seminar on Neural Network Applications in 
Electrical Engineering, pp. 87-90 (2002) 

Dudek, G.: Multilayer perceptron for GEFCom2014 probabilistic electricity price forecasting. 
Int. J. Forecast. 32(3), 1057-1060 (2016) 

Kuo, P.-H., Huang, C.-J.: An electricity price forecasting model by hybrid structured deep 
neural networks. Sustainability 10(4), 1280 (2018) 

Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016) 

Amarasinghe, K., Marino, D.L., Manic, M.: Deep neural networks for energy load forecasting. 
In: 2017 IEEE 26th International Symposium on Industrial Electronics (ISIE), pp. 1483-1488 
(2017) 

Khan, S., Javaid, N., Chand, A., Khan, A.B.M., Rashid, F., Afridi, I.U.: Electricity load fore- 
casting for each day of week using deep CNN. In: Kalbitzer, U., Jack, K.M. (eds.) Primate 
Life Histories, Sex Roles, and Adaptability, pp. 1107-1119. Springer International Publishing, 
Cham (2019) 

Kollia, I., Kollias, S.: A deep learning approach for load demand forecasting of power systems. 
In: 2018 IEEE Symposium Series on Computational Intelligence (SSCI), Bangalore, India, 
pp. 912-919 (2018) 

Dong, X., Qian, L., Huang, L.: A CNN based bagging learning approach to short- 
term load forecasting in smart grid. In: 2017 IEEE SmartWorld, Ubiquitous Intelli- 
gence Computing, Advanced Trusted Computed, Scalable Computing Communications, 
Cloud Big Data Computing, Internet of People and Smart City Innovation (Smart- 
World/SCALCOM/UIC/ATC/CBDCom/IOP/SCD, pp. 1-6 (2017) 

Shi, H., Xu, M., Li, R.: Deep learning for household load forecasting—a novel pooling deep 
RNN. IEEE Trans. Smart Grid 9(5), 5271-5280 (2018) 

Yu, Z., Niu, Z., Tang, W., Wu, Q.: Deep learning for daily peak load forecasting—a novel gated 
recurrent neural network combining dynamic time warping. IEEE Access 7, 17184-17194 
(2019) 

Bedi, J., Toshniwal, D.: Deep learning framework to forecast electricity demand. Appl. Energy 
238, 1312-1326 (2019) 

Kong, W., Dong, Z.Y., Hill, D.J., Luo, F, Xu, Y.: Short-Term residential load forecasting based 
on resident behaviour learning. IEEE Trans. Power Syst. 33(1), 1087-1088 (2018) 

Marino, D.L., Amarasinghe, K., Manic, M.: Building energy load forecasting using deep neural 
networks. In: IECON 2016-42nd Annual Conference of the IEEE Industrial Electronics Society, 
pp. 7046-7051 (2016) 

Gan, D., Wang, Y., Zhang, N., Zhu, W.: Enhancing short-term probabilistic residential load 
forecasting with quantile long—short-term memory. J. Eng. 2017(14), 2622-2627 (2017) 
Zheng, J., Xu, C., Zhang, Z., Li, X.: Electric load forecasting in smart grids using long-short- 
term-memory based recurrent neural network. In: 2017 51st Annual Conference on Information 
Sciences and Systems (CISS), pp. 1-6 (2017) 

Chung, J., Giilcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural 
networks on sequence modeling. In: CoRR (2014). http://arxiv.org/abs/1412.3555 

Kumar, S., Hussain, L., Banarjee, S., Reza, M.: Energy load forecasting using deep learn- 
ing approach-LSTM and GRU in spark cluster. In: 2018 Fifth International Conference on 
Emerging Applications of Information Technology (EAIT), pp. 1-4 (2018) 


128 A. Almalag and J. J. Zhang 


38. Gao, X., Li, X., Zhao, B., Ji, W., Jing, X., He, Y.: Short-term electricity load forecasting model 
based on EMD-GRU with feature selection. Energies 12(6), 1140 (2019) 

39. Almalag, A., Zhang, J.J.: Evolutionary deep learning-based energy consumption prediction for 
buildings. IEEE Access 7, 1520-1531 (2019) 

40. Bouktif, S., Fiaz, A., Ouni, A., Serhani, M.A.: Optimal deep learning LSTM model for elec- 
tric load forecasting using feature selection and genetic algorithm: comparison with machine 
learning approaches. Energies 11(7) (2018) 

41. Long-Term Energy Consumption & Outdoor Air Temperature For 11 Commercial Buildings- 
Openei Datasets. Openei.org (2019) 

42. Chollet, F. et al.: Keras. GitHub (2015) 

43. Pedregosa, F, et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 
2825-2830 (2011) 


Fast and Accurate Seismic Tomography ® 
via Deep Learning ot is 


Mauricio Araya-Polo, Amir Adler, Stuart Farris and Joseph Jennings 


Abstract This chapter presents a novel convolutional neural network (CNN)-based 
approach to seismic tomography, which is widely used in velocity model building 
(VMB). VMB is a key step in geophysical exploration where a model of the sub- 
surface is needed, such as in hydrocarbon exploration for the Oil & Gas industry. 
The VMB main product is an initial model of the subsurface that is subsequently 
used in seismic imaging and interpretation workflows. Existing solutions rely on 
numerical solutions of wave equations, and requires highly demanding computa- 
tion and the resources of domain experts. In contrast, we propose and implement 
a novel 3D CNN solution that bypasses these demanding steps, directly producing 
an accurate subsurface model from recorded seismic data. The resulting predictive 
model maps relationships between the data space and the final earth model space. 
The subsurface models are reconstructed within seconds, namely, orders of magni- 
tude faster than existing solutions. Reconstructed models are free of human biases 
since no initial model or numerical technique tuning 1s required. This chapter is a 
significant extension of previous published material and provides a detailed expla- 
nation of the seismic tomography problem, and of the previously unpublished 3D 
CNN architecture, training workflows and comparisons to state-of-the-art. 
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1 Introduction 


The main workflow of hydrocarbon exploration starts with acquiring field data, which 
consist of recordings of the response of the subsurface to artificial perturbations. Fol- 
lowing data acquisition, several disciplines [1], geology, geophysics, petrophysics, 
etc., combine efforts to produce a model of the earth (see Fig. 1, top right) which may 
or may not have clear indications of hydrocarbon presence. In areas such as the Gulf 
of Mexico, hydrocarbons tend to accumulate near salt bodies making them a key 
geological structure in earth model building [2]. This earth model is a critical part 
of the decision making process and is given utmost importance during exploration 
projects. The average success ratio of the industry is low, thus avoiding unnecessary 
expenses, such as drilling wells, translates into saving millions of dollars. Therefore, 
techniques to accelerate the decision time and increase the success ratio are crucial. 

What we propose in this chapter goes beyond what is currently making inroads in 
exploration geosciences, which is machine learning (ML) techniques being applied to 
specific well-known steps of the standard hydrocarbon exploration workflow (Fig. 1, 
red arrow). Most of the advances happen on the interpretation [3, 4] of the models 
rather than in the generation of them. Alternatively, our method is a end-to-end solu- 
tion, producing earth models directly from unmanipulated seismic data. Our method 
differs from current velocity building methods, seismic tomography [5] (similar to 
medical tomography but the penetrating wave is seismic) or wave equation-based 
modeling/inversion, in that our method is automatic and without human interven- 
tion. The deep learning (DL) technique employed follows recent work [6, 7] that 
demonstrates this new approach, which uses a deep neural network (DNN) statisti- 
cal model to transform raw input seismic data directly to the final mapping in 2D 
or 3D model space. The computational cost of the proposed approach is mostly due 
to the training phase, which occurs only once and offline. After training, velocity 
model reconstruction computational costs are negligible, thus making the overall 
computing requirements a fraction of those needed for traditional techniques, in 
particular the ones involving partial differential equations (PDE)-based simulations. 
As a preliminary step, velocity semblance [8] is used as the input feature space, 
which apparent seismic velocity (main attribute of an earth model) information for 
the training process. While we do perform feature extraction, rather than use the raw 
data, this feature extraction step is automated and not subject to human bias. Later, 
we extend the approach to work directly on the raw recordings thus freeing it from 
feature extraction and using the fully accepted unmanipulated seismic data as input. 

The main design concern relates to the generalization capability of the DL-based 
solution, which basically indicates how much a trained model can accurately predict 
unseen data. To address that concern, we foresee models being trained with specific 
data belonging to different major exploration areas such as: pre-salt (Brasil offshore) 
or subsalt (Gulf of Mexico or West Africa). Regarding future hydrocarbon exploration 
workflows, one can imagine this technique being used just after data acquisition (field 
recording), then trained models are loaded up to the cloud from which interpreters 
can access realizations, thus performing online scenarios testing when feeding back 
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Fig.1 Overall vision of the new exploration geophysics workflow (green arrow), where the classical 
way of approaching the problem is depicted in bottom following the red arrow 


their model modifications. This anticipated workflow is fully ML-based, flexible and 
with the domain experts at the center of the critical decision making process. Finally, 
we envision that this technique can also be applied to other tomography problems 
that arise in the geosciences such as global seismology, shallow hazards, etc. 

This chapter is organized as follows: Sect.2 introduces the seismic tomography 
problem and the principles of seismic data acquisition. Section3 explains the DL 
approach and the semblance geophysical feature. Section4 presents experimental 
results with 2D synthetic seismic data. Section 5 compares our DL results against the 
state-of-the-art results obtained with industry’s tool of choice. Section 6 introduces 
our preliminary results without extracting features from the data. Finally, conclusions 
and future research are provided in Sect. 7. 


2 The Seismic Tomography Problem 


To provide a complete context of the earth model building problem, before delving 
into our proposed DL-based solutions, this section explains the data to be used 
through the chapter and, in a succinct manner, reviews the scientific problem at 
hand. 


2.1 Seismic Data 


Seismic data are acquired, for the onshore case, via sources positioned on the earth’s 
surface and arrays of receivers (geophones). In the offshore case, the sources and 
receivers (hydrophones) are towed by a ship, as illustrated in Fig. 2. 
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Seismic changes 


The search for oil nearly always involves seismic 
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In recent years, offshore seismic operators have 
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Fig. 2 Offshore seismic data acquisition (Source Houston Chronicle, BP, Schlumberger, Fairfield 
Nodal) 


When energy is emitted from the source, it propagates through a highly het- 
erogeneous medium (i.e., subsurface) which in turn creates reflections, refractions 
and diffraction effects that are recorded at the receiver (sensors) locations. As these 
recorded events are created due to changes in subsurface rock properties, inherently 
they contain information about the subsurface from whence they originated. The 
goal of seismic tomography and seismic imaging in general is to reconstruct the 
subsurface (earth model) that created the recorded seismic data. 

With only one source firing and a finite number of receivers, only a limited portion 
of the subsurface target of interest can be sampled. Therefore, in order to adequately 
illuminate the subsurface, it is required that the source and array of receivers be 
positioned at multiple spatial locations. In reflection seismic terminology, the data 
obtained from the source firing at a single position x,, into an array of receivers 
X;,, J =1,..., N, where N, is the total number of receivers recording during a 
source firing is known as a “shot gather”. Modern reflection seismic data acquired for 
industrial purposes are composed of hundreds of thousands of shot gathers. Figure 3 
depicts the ray paths (discrete approximation of a wavefront) associated with a shot 
gather for a single layer subsurface model and synthetic data recorded as a result of 
finite-difference modeling with a point source located at the position x,. Note that 
as the source moves along the surface with a dense array of receivers, subsurface 
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Fig. 3. (left) Raypaths for a seismic shot gather acquired over a flat layer earth. (right) Simulated 
data using finite-difference modeling. The linear event corresponds to the wave that travels directly 
from the source to the receivers along the surface. The hyperbolic event corresponds to the reflection 
of the wave off of the layer interface 


points will be illuminated multiple times. To take advantage of this data redundancy, 
seismic data are typically transformed into midpoint and half-offset coordinates via 
the following relations 


Xs; ai Xr, 
Xm; = a 

Xs, — Xy; 
Xh, = > 


where x,,, and x, are the midpoint and half-offset coordinates respectively. Figure 4 
shows the resulting raypaths and data that arise from sorting the data in Fig. 3 into the 
midpoint and half-offset domain. As this collection of records is for a fixed midpoint 
and several offsets, this type of data is known as a common-midpoint gather. The 
processing of seismic data for velocity model building in general is performed with 
the data transformed into the midpoint and half-offset coordinates. 

In Fig.5, a selected group of traces from a more complex subsurface recording 
is presented. The field recordings—depending on the origin—are like the above 
depicted ones or more complex, therefore direct interpretation of subsurface structure 
is ruled out and this originates the need for advanced techniques to transform this 
data into usable models. 


2.2 Seismic Tomography 


The study of seismic tomography has spanned the past several decades and continues 
to be part of ongoing research [9]. While there exist many ways to formulate this 
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Fig. 4 (left) Raypaths for a seismic common midpoint gather acquired over a flat layer earth. 
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Fig.5 (left) Windows in time and space on a shot gather from a complex subsurface model simulated 
with finite-difference modeling, therefore very high signal-to-noise ratio. (right) Selected traces 
from the shot gather of the left, traces presented as wiggles, where characteristics of the signals are 
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reconstruction problem [10-12], it can be expressed most generally as the following 
minimization problem: 
m”* = arg min{L (f(m), d)}, (1) 
m 


where m represents the earth model that we desire to recover, d represents the 
recorded seismic data, f(m) is a physics-based modeling that generates synthetic 
data from a prescribed earth model, L is a loss function that measures the mis- 
fit between the recorded data and the simulated data, and m”* is the optimal earth 
model that minimizes the loss L. While a highly complex m that informs us of many 
different earth properties (elastic moduli, density, viscoleastic parameters, etc.) is 
generally desired, m commonly represents a three-dimensional acoustic wavespeed 
model. This choice of m generally leads to the scalar acoustic wave equation as the 
choice for our physics-based simulation f(m). Further simplifications can be made 
in taking the high-frequency limit of the scalar acoustic wave equation which results 
in the eikonal equation [13]. While the wave equation describes the propagation of 
waves and calculates synthetic seismograms (waveforms), the Eikonal equation is 
based on ray theory and calculates traveltimes. Regardless of the model parameter- 
ization and physics-based forward model used to fit the recorded geophysical data, 
the relationship between the data and the desired earth model is nonlinear. There- 
fore, a nonlinear optimization algorithm is required in order to minimize the loss- 
function (in Eg. (1)). Additionally, because f (m) is in general very computationally 
expensive to evaluate, local/gradient-based methods for optimization must be used 
as opposed to global optimization methods. Using only the gradient information of 
the loss function can result in convergence to a local minimum and therefore unsat- 
isfactory solutions. Additionally, because for reflection seismic surveys the data are 
recorded at the earth’s surface, the data do not contain all of the necessary information 
to define a velocity model that varies arbitrarily space. This therefore implies that 
Eq. (1) defines a non-linear ill-posed optimization problem. In using a deep-learning 
approach, while we still face this issue of non linearity and ill-posedness, we do not 
rely on an accurate solution of the wave-equation, but rather directly learn a tomo- 
graphic operator from many training examples that consist of the seismic data as 
feature and the velocity model as label. 


3 Seismic Tomography via Deep Learning 


3.1 Deep Neural Networks for Inverse Imaging Problems 


Seismic tomography is an inverse imaging problem, in which the observation model 
can be represented as: 
d= f(m) +e, (2) 
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where d is the observed seismic data, m is the unknown earth model, f() is a map- 
ping operator and € is noise. While inverse imaging problems can be solved using 
analytic models, recent works [14-16] (and references within), argue that state-of- 
the-art results for a variety of inverse imaging problems can be obtained using deep 
learning methods. Following this line of work, we have proposed a novel approach 
[7] that implements the tomography operator using a convolutional ceural Nctwork 
(CNN), whose coefficients are learned in a data-driven approach [17]. The tomog- 
raphy process is depicted in Fig.6, and it performs reconstruction of the velocity 
model from raw seismic traces, or from features computed from raw seismic traces. 
In a real-life application, the ground-truth model is unavailable, and the tomography 
operator is designed to minimize the difference between the reconstructed velocity 
model and the (unavailable) ground-truth one. The input to the tomography operator 
T(d; @) is a set of seismic traces (or their features) d, and it is parameterized by a 
coefficients vector 9. The tomography operator approximates the inverse mapping 
operator f—'(), and its output is the predicted velocity model m. In the statistical 
learning framework, the tomography operator is learned using a collection of N train- 
ing example pairs {d;, m;};_,, where the data d; denotes the set of seismic traces (i.e. 
seismic gather) or their features, as generated by wave propagation simulation using 
the i-th velocity model m; (the 1-th label). The average misfit between the ground 
truth models and their predicted versions, also known as the empirical risk, is defined 
by: 


1 N 
J(A) = = | L(m;, th), (3) 
i=! 


where L(m;,mz;) is the loss function that measures the misfit between the ground 
truth velocity model and its prediction m; = T(d;; 0). The tomography operator is 
learned by minimizing the empirical risk: 


§ = arg min J(A). (4) 


The loss function employed in this work is the squared L2-norm of the pixel-wise 
difference m — m, given by: L(m;, m;) = ||m; — mz; ||}, which is frequently used in 
regression problems, and leads to the following risk minimization problem: 


N 
. 4 ; 
d= PSs, Im; — T(d;; 8)|[3. (5) 


In addition, regularization [18] of network parameters is optionally applied by an 
additional term R(@), leading to the following minimization problem: 


N 
A : i 2 
@ = arg min — d Im; — T(d;; 8); + AR(), (6) 


Fast and Accurate Seismic Tomography via Deep Learning 137 


++ 


Recorded Reconstructed 
Seismic Data Velocity Models 





Fig. 6 Tomography reconstruction of velocity models from recorded seismic data 
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Fig. 7 Convolutional Neural Network (CNN): a a CNN with two convolutional layers and one 
fully-connected layer; and b zoom into the first convolutional layer (Source [20]) 


where A > 0 controls the weight of the regularization term R(@), which is often 
defined as Ridge regression R(@) = Kale or Lasso regression R(@) = ||@||1. 

The tomography operator is implemented by a CNN, and thus can be represented 
as a hierarchical composition of k non-linear functions, each representing one of the 
k layers of the network: 


T(d; 6) = gx (ge-1¢- -- g2(gi (ds 61); 82); Ax—1)3 Ox), (7) 


where 6 = [0), 02,--- , O-1, |’, and each function represents either a fully- 
connected (FC) or a convolutional layer [17, 19], as illustrated in Fig. 7. 
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3.2 Velocity Semblance as Input Feature for Deep Networks 


Feature extraction is an optional step in our workflow as it can accelerate the training 
of the CNN by providing it with the most relevant data for learning. Our ML platform 
is capable of handling diverse network architectures and data, but given the focus 
on learning a tomographic operator from the data, we perform what is known as 
velocity (main subsurface model attribute) analysis and use its output as the input 
feature space. 

To perform velocity analysis, we first transform the data into the midpoint half- 
offset coordinates as discussed previously. Then, we perform a time shift to each 
offset h of the common-midpoint gather in order to flatten the reflection (which has a 
hyperbolic shape) along the offset direction. This time-shift is a function of the half- 
offset h and the velocity in the medium V and can be calculated via the following 


relationship 
2 


Paw =RtS, (8) 
where f¢ is the travel time of the hyperbolic event and fo is the time at which the data 
were recorded. Note that Eq. 8 describes the shape of a hyperbola which is exactly the 
shape of the recorded reflection shown in Fig.4. Performing this time shift requires 
that the medium velocity be known a priori (which in the case of VMB is not). 
Therefore, trial velocities are prescribed in order to flatten the reflection event and 
then the following coherency measure is used in order to measure the flatness of the 


time-shifted event 
i+M /N=1 | 2 
» (s ais) 
j=i-M \k=0 


si) = (9) 


i+tM N— 


; ; 
N > Dd ali. kl? 


j=i—-M k=0 


where g[/, k] is the time-shifted common-midpoint gather for a particular velocity 
V and j and k are the time and offset indices respectively. The inner sum over all NV 
offsets sums the time-shifted event along the offset direction. Therefore, the flatter 
the event (or the closer the prescribed velocity is to the true velocity), the greater the 
output of the sum. The outer sum is an average in time over a window of 2M + 1 
samples. The output s[i] coherency measure is known as semblance [8] and is often 
the first step towards building a velocity model in reflection seismology. Performing 
this calculation for multiple midpoints, a semblance cube which has axes of time, 
velocity and midpoint can be created. The right half of Fig.8 shows an example of 
a semblance cube for the velocity model shown in the left half of Fig. 8. Note that 
while the semblance cube does not offer very high resolution information about the 
velocity model. Rather, it gives an overall trend of how the velocity increases with 
depth from midpoint to midpoint. 
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Fig.8 (left) 2D Synthetic earth model, layers of sediments in a simple depositional system, velocity 
ranges between 2000 and 4500 m/s. Horizontal coordinate is and vertical represents depth. (right) 
Example of a calculated semblance cube for the model in left. In our case, during training, models 
like left are the labels and semblance cubes the input data 
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4 Semblance-Based CNN Results 


In this section we first describe the experimental setup, including network archi- 
tecture, datasets, hardware and software and metrics used for quantitative analysis. 
Second, we present the qualitative and quantitative results and discussion. 


4.1 Experimental Setup 


The network architecture is composed of four 3D convolutional layers (64 filters with 
kernel size of 6 x 6 x 6) and two fully connected layers. Each layer employs a ReLU 
activation function. In addition, max-pooling, batch normalization and dropout with 
probability of 0.25 are deployed after each convolutional layer, as depicted in Fig. 9. 
The loss function is mean squared error and Nesterov ADAM [21, 22] optimizer is 
used. The network is implemented in python using TensorFlow [23] and Keras [24] 
as DL supporting frameworks. 

The training reaches early stopping on around 250 epochs, in about 6 hours run- 
ning on one high performance computing (HPC) node sporting four general purpose 
graphical processing units (GPGPUs) NVIDIA K&80 [25] in data parallelism fash- 
ion. In this parallel execution mode, the model is copied to all computing units and 
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Fig. 9 Semblance-based 3D CNN architecture: the semblance cube is the input feature to the net- 
work, which includes four 3D convolutional layers and two fully connected layers. Each convolu- 
tional layer is composed of 3D kernels, ReLU activation per kernel, Maxpool, Batch Normalization 
and Droput layer 
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Fig. 10 (left) The plot shows the metrics value across training time. The vertical axis represent the 
metric value and the horizontal axis represents time in epoch units, where one epoch is a complete 
sweep through the training dataset. (right) A detailed view of the left plot around an area of interest. 
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then the training data are evenly split and distributed among the computing units to 
be solved. Inference per model is a matter of seconds, which is extremely appeal- 
ing when large amount of data is predicted or multiple velocity scenarios are under 
investigation. 

Two datasets are prepared for both training and testing our model. In the first 
dataset, the velocity models only contain layers with velocities that increase with 
depth. Additionally, the layers exhibit both undulation and dip (Fig. 11). The second 
dataset consists of similar velocity models as the first dataset, but now a portion have 
been augmented with salt bodies. To add a factor of realism to these models, the shape 
of the salt bodies were extracted from earth models that were the end result of real 
life exploration projects in the Gulf of Mexico. Moreover, this dataset also contains 
velocity models without salt bodies. Each dataset consists of 6400 semblance cubes 
and the corresponding velocity model labels of size 100 x 100 grid points (the size of 
the output layer). For validation and testing purposes, we separated 1600 data/label 
pairings. 
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4.2 Quantitative Metrics 


In terms of quantitative metric for model quality comparison, we decided to recourse 
to the widely accepted standard metric in image-dominated fields, the structural 
similitude index metric (SSIM) [26] and peak signal-to-noise ratio (PSNR). SSIM 
differs from traditional objective metric since it is based on structural degradation 
rather than error or general distortion of the images. 

From statistical perspective the robustness of the model is measure with R* score 
(coefficient of determination). 


4.3 Results and Analysis 


The prediction accuracy metrics on the testing set for the first dataset (earth models 
only containing layers) are 0.812 for the R* score and 0.919 for the SSIM. R? score for 
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Fig. 11 (top, left) model 1 of the test dataset, includes salt bodies, which have high velocity and 
tend to distort classical modeling. Salt bodies are key in offshore hydrocarbon exploration. (top, 
right) prediction, where vertical axis represents depth and horizontal axis represents lateral offset. 
(bottom, left) comparison of the velocity profile for x = 400, where the vertical axis represents depth 
in meters. (bottom, center) absolute error between the ground-truth and prediction and (bottom, 
right) the corresponding error distribution 
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Fig. 12 Improving model reconstruction (for one model in the testing dataset) as the learning 


process sweeps through the training dataset 
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the test set with the second dataset is 0.741 and the SSIM is 0.892. The convergence 
of these metrics with respect to epoch can be observed in Fig. 10. The convergence 
curves shows the influence of different batch sizes on the performance for both 
metrics, although the effect is most noticeable on the R? metric, which converges 
later than the SSIM metric. As expected, the task of predicting a model with salt 
bodies is more difficult and therefore the performance is lower for dataset two than 
dataset one. It is difficult to learn the size, shape, and location of these salt bodies 
from the input data space. Furthermore, the datasets are relatively small for the task at 
hand. The main impediment to obtaining more training data is the computationally 
expensive step of generating features via finite difference wave propagation and 
calculating the semblance feature. 

Qualitatively, the overall performance trend is positive, the salt bodies are mostly 
located properly and the surrounding formation resembles the labels in structure and 
velocity value (see Fig. 11), thus making the predicting model valid. 

The main structural elements of the predicted model matches the ground-truth. 
The predicted expression and location of the salt body in Fig. 11 1s remarkable. In 
particular, the velocity profile shows that the velocity trend is perfectly recovered, 
only missing the sharp interfaces between layers. In Fig. 12 we observe how a model 
from the validation set is learned as the training of the network progresses (by epochs). 
The first prediction (Fig. 12, top row) shows a model with low velocity in a gradient- 
based background, with many unresolved samples (blue dots). After few epochs 
(Fig. 12, center row) the predicted model corrects the deeper sections towards higher 
velocity and the salt body is clearly reconstructed. Finally, the model prediction is 
complete (Fig. 12) and even fine grained details of the salt body are satisfactory 
resolved. 


5 Industry Baseline: Full Waveform Inversion 


5.1 Industry VMB Methods 


Many seismological techniques exist to estimate material properties in the shallow 
mantle of the earth. Ray tracing methods rely on high frequency ray theory approx1- 
mation and picked arrival times of body waves to invert for optimal shear and pressure 
velocity models [27, 28]. These techniques are restricted to smooth medium predic- 
tions and fall short when surface wave amplitudes dominate body wave arrivals [29]. 
Full-waveform inversion (FWI) attempts to achieve this by iteratively simulating the 
seismic experiment and updating the earth model until the simulated seismic data 
matches the recorded seismic data in a least squares sense [30]. By fully model- 
ing how energy propagates through the subsurface, FWI is more likely than other 
methods to find accurate representations of the earth’s material properties [31]. 
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5.2 Full Waveform Inversion 


FWI uses the entire seismic wavefield recording, that being all recorded frequencies 
and locations, to invert for earth model parameters beneath the surface. The goal of 
FWIL is to find some earth model that minimizes the distance between modeled seis- 
mic data, which is a function of the earth model, and recorded seismic data, which 
was gathered in the field. When we have changed the earth model in such a way that 
the modeled data very closely resembles the recorded data, we assume we have found 
an earth model that is representative of the true earth model. Albert Tarantola [30] 
was the first to propose solving for earth parameters with such an inverse solution. In 
exploration geophysics, FWI is a topic of intense study and 1s at the forefront of earth 
model building from seismic data [32]. That being said, it is plagued with numer- 
ous limitations including high computational cost, extreme sensitivity to the choice 
of starting model, and unwanted convergence to incorrect earth model solutions. 
Moreover, when these limitations are properly accounted for and addressed, FWI is 
regarded as an area of development that may rectify the gap between low and high 
wavenumber earth model building and represent an all-inclusive solution to seismic 
exploration. For this reason we have chosen it as a baseline method to compare the 
velocity model prediction results of the ML approach defined previously. If ML can 
compete with the current cutting edge industry techniques, it will surely make waves 
in the exploration community. 

More verbosely, consider the ith shot of a seismic survey da?’ * where 
i = 1,2,..., M. Further, consider some modeled data, a, which is the synthetic 
recreation of the i” observed experiment. We can define the distance between the 
observed and modeled data as the L> norm of the two vectors, 


Lyd", d?”") = ||d"°% — d?”* ||. (10) 


To create the modeled data we use some wave equation operator, f;, which rep- 
resents a single seismic experiment. f; is a function of the earth model, m, and 
maps from the earth model space into the data space, f;(m) = a In our case m 
represents 1/v~, the inverse of the squared pressure wave velocity, at each point in 
the subsurface. Many wave equation formulations can model how seismic energy 
propagates through the earth. Generally speaking, the more complex and accurate 
the wave equation, the more computationally expensive the wave modeling becomes. 
For our purposes we use the acoustic, constant density, isotropic wave equation [33]. 


(A —MD,)p =f, (11) 


where A is a laplacian operator, D2 is a second time derivative operator, M is com- 
posed of nt copies of the flattened m, where nt is the total number of time samples 
in the seismic recording, p is the pressure wavefield, and f is the injected seismic 
source. This wave equation assumes the earth can be represented by a single elastic 
parameter, pressure wave velocity, is isotropic, has constant density, and has a zero 
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shear modulus. We can represent this wave equation operator with a matrix, H(m), 
and solve for the wavefield p,: 


H(m)p; = f, (12) 
p; = H' (mf, (13) 


where p; represents the wavefield resulting from the ith seismic experiment in the 
entire domain. We can use an operator, K , to extract the wavefield at the point receiver 
locations to arrive at the modeled data, d’”“: 


di”? — Kp; = KH '(m)f; = f,(m). (14) 


Using this wave equation operator we can define a scalar function, J(m), which 
sums the L» difference between modeled and observed seismic data over all experi- 
ments: 


N 
J(m) = > || fi@m) — d?”*|[5. (15) 


i=] 


Here we have arrived at what is referred to as the FWI objective function. The 
model that reaches the minimum of this objective function is the solution to the FWI 
problem and the model that is our best estimate of the velocity profile beneath the 
surface. 

Solving this inverse problem, that is finding the model that minimizes /(m), 
is notoriously difficult for a variety of reasons. Primarily, the objective function is 
nonlinear with respect to m, which means a perturbation in the earth model is not 
linearly mapped into the modeled data. It follows that the numerous, well studied 
strategies to solve linear least squares inverse problems are useless to us. Instead we 
must resort to nonlinear regression techniques for which there is no general theory for 
finding the optimal model parameters [34]. Iterative methods are a popular choice for 
solving nonlinear inverse problems and rely on the gradient of the objective function 
at the current model iteration, m,;, to update the model parameters to find the next 
model iteration, m j+1. 


Mm j+1 =m; + a;§;. (16) 
The next model, m1, is found by summing the current model, m;, to some search 
direction, s;, scaled by a step length, a;. There are many ways to compute the search 
direction, s;. We use the nonlinear conjugate gradient method in which: 


s; = )_-1 + 6VJ(m)), (17) 


where $;_; is the previous search direction and (3 is the conjugate direction coef- 
ficient, and V/J(m,) is the gradient of the objective function at the current model. 
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Furthermore, B(m ;)* is the adjoint of the wave equation operator linearized around 
the current model iteration applied to the difference between the modeled and 
observed data: 


N 
VJ(m;) =— )— B(m;)*(f;(m) — d?”’) (18) 


i=] 


To put it concisely, at each iteration of FWI we use the gradient of the objective 
function to update the earth model in order to reduce the value of the objective func- 
tion. We stop iterating when the objective function reaches zero or, more realistically, 
once it stops reducing. 

However, the nonlinearity of the FWI objective function means it is not convex 
with respect to the earth model. Gradient descent methods, like the one described 
above, will fall into a local minimum, that is find an earth model at which the objective 
function stops reducing but does not represent the global minimum of the objective 
function. In order to avoid local minima, the initial model used in the inversion 
scheme, mp must be fairly close to the true model. Herein lies one of the largest 
restrictions of FWI, that being we must start from an earth model that is fairly close 
to the true model in order for the gradient based optimization algorithm to converge 
to the true solution. 

Many methods exist and extensive research continues to find ways to avoid these 
convergence issues. A highly effective and widely accepted method is that of [35] 
which is referred to as multiscale FWI. This technique decomposes the FWI problem 
by scale and performs conventional FWI with progressively higher bandpasses of 
the source wavelet and observed data. 


5.3 Baseline Comparison Setup 


To compare the velocity model predictions of the proposed CNN-based approach 
and FWI, four synthetic seismic surveys are created and used as inputs. The intent 
is to keep the input data consistent in order to create a fair comparison between each 
method as in [36]. Below describes the data generation, the parameters of the three 
experiments, and the quantitative methods used to compare model results. 

The synthetic models used to generate the seismic data are 1.8 km in the x direction 
and 1.4km in the z direction with grid cell discretization of 10 m. The models 
parameter used is pressure wave velocity that increases with depth and contains salt 
bodies of varying shape and size. The velocities range from 2.0 to 4.5 km/s. All four of 
the models have layer cake backgrounds which are representative of the upper crust 
of the earth. Three of the four compared models contain high velocity zones, around 
4500 m/s, which are characteristic of larger salt diapers that often trap migrating oil 
and gas. It follows that the industry places large interest on finding and resolving 
these salt bodies. 
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Fig. 13. Ground-truth velocity models 


The data itself is generated from 19 shots at the surface with 40 m spacing in the x 
direction beginning at 520m. The shot wavelet is a 15 Hz peak Ricker. 144 receivers 
located at the surface record pressure data. They begin at 180 min x with 10m spacing. 
The wave propagation modeling assumes an acoustic, constant density earth and uses 
second order approximation in time and eight order in space. Figure 13 illustrates 
the four models used to compare each method. Note, the data was generated on 
1.8 x 1.4km model but the velocity predictions were made ona 1.0 x 1.0km subset 
of the original models. 


5.4 Experiments 


The first experiment is conducted with conventional FWI; 1000 iterations of non- 
linear conjugate gradient are performed using all frequencies of all modeled shots. 
The starting model was a linear velocity gradient from 2.0 to 4.5km/s. A variation 
of this experiment is also conducted in which 200 conjugate gradient iterations are 
performed using the predicted model from the CNN as the starting model for FWI. 
The second experiment is multiscale FWI which performed 150 conjugate gradient 
inversions over 5 bandpasses of the all modeled shots. The first 4 bandpasses of the 
data were smoothly tapered at 4, 8, 16, and 32 Hz. The fifth inversion used all frequen- 
cies. The starting model for the 4Hz inversion was a linear velocity gradient from 
2.0 to 4.5 km/s. Each progressively higher bandpass inversion uses the final model 
from the previous bandpass inversion. The third experiment results are obtained by 
exposing the trained neural network to unseen data, in our case, to unseen semblance 
cubes from velocity models created by our pseudo-random velocity model generator. 


5.5 Results 


We perform the comparative analysis on four seismic datasets generated from the 
velocity models in Fig. 13. The comparison is limited to four datasets because of 
the high computational cost of FWI. In fact, retrieving one multiscale FWI result 


148 M. Araya-Polo et al. 


takes more time than training the CNN used for the ML approach. After the upfront 
cost of creating the trained CNN, a single model prediction can be made almost 
instantaneously. This speaks to the computational cost of ML compared to FWI. 

Figures 14, 15, and 16 depict the results of the three VMB methods on the four 
models both visually and numerically. In Figs. 14 and 15, rows correspond to various 
models and columns to VMB methods. Since salt diapers are of large interest in the 
oil and gas community, comparisons are also made over windowed portions of the 
earth models that contain such bodies. Figure 16 gives a more in-depth look into the 
results on model 0 by computing difference plots between the true models and each 
of the VMB method results. This gives intuition on where each method is over or 
under-performing relative to the others. It also shows the error histograms to illustrate 
the distribution of velocity errors. 


5.6 Discussion 


A large impact would be made in the exploration seismic community if a method 
emerged that could construct earth models more effectively than FWI. We claim to 
have found such a method that leverages ML to show promising result on synthetic 
experiments. Our approach succeeds where FWI fails, in that ML is more robust, 
void of human bias, and computationally cheaper. 

To backup this claim, we must analyze the experimental results visually and 
numerically in Figs.14, 15, and 16. When comparing the results of the three 
approaches, we observe that both the CNN prediction and multiscale FWI were able 
to recover the original velocity model with good accuracy while the conventional 
FWI approach fell into a local minimum and was not able to recover a reasonable 
velocity solution. For example, examine the full view and the zoom view results of 
model 3, plotted in rows two and three of Fig. 15. The CNN and multiscale FWI meth- 
ods both resolve the correct salt body location while conventional FWI completely 
misjudges the depth and shape of the body. Furthermore, the CNN better predicts 
the complex outline of the salt, including the bottom side which is notoriously dif- 
ficult in real world applications. In general we find that the output of the DNN is 
smoother than the velocity estimated via multiscale FWI. This is highlighted in the 
difference plots of Fig. 16. One can see that the multiscale FWI approach performs 
better at resolving the interfaces between the layers. This likely due to the fact that 
when calculating the input semblance cubes, a smoothing occurs which limits the 
maximum frequency in the semblance cube. Multiscale FWI, however attempts to 
match modeled and to predicted data that may have a broader range of frequencies. 
Figures 14 and 15 do show that, generally, multiscale FWI, according to the SSIM 
metric, outperforms our ML results. But we find visually that the resolution of the 
salt bodies from the ML approach is more impressive and appealing from an oil and 
gas interpretation point of view. 

Even though the results of DL are preliminary, they are already competitive with 
FWI. Consider the biases present in each approach. In order for FWI to succeed, 
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Fig. 14 Velocity Model Building results comparison: (1st row) Models 0, (2nd row) zoom into the 
salt body in Model 0, (3rd row) Models 1; and (4th row) zoom into the salt body in Model 1 


we needed to use the multiscale scheme. There are dozens of other regularization 
methods that may or may not work depending on the specific experiment. It is left 
up to the geophysicist to decide. Furthermore, the sensitivity to the starting model 
means a priori information on the structure of the earth must be known. In real 
world scenarios, the starting model used in these experiments, the linearly increasing 
velocity profile with depth, will not suffice. A fairly detailed starting model must be 
constructed by the geophysicist beforehand. Alternatively, the ML approach did not 
need any handpicked regularization of the input data and it requires no starting 
model. ML retreived competitive results without any human bias. Furthermore, FWI 
has been in development for 20 years and our ML method is also in its infancy. 
If we can recover sharper velocity model results with ML, and thus beat the FWI 
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Fig. 15 Velocity Model Building results comparison: (1st row) Models 2; and (2nd row) Models 
3, (3rd row) zoom into the salt body in model O 


results, there is nothing stopping ML from replacing FWI. Beyond comparing the 
velocity model results, we must address an equally important aspect, computational 
efficiency. The ML and FWI results were computed at different high performance 
computer cluster facilities, making a direct computational comparison difficult. But, 
we will find that examining precise clock cycles is not necessary because, by rough 
estimation, ML is orders of magnitude more efficient. Consider that to perform 1000 
iterations of nonlinear conjugate gradient to recover the multiscale FWI results took 
about two days on a busy Stanford University computer cluster. Now, of course, the 
modeling and inversion codes used were for academic purposes and were therefore 
not fully optimized. But, the earth models used are also fairly small, 1 km x 1 km, 
by industry standards. If more efficient code was used on larger models, the compute 
time would most likely remain on the same order of magnitude; days. Now, consider 
that training the CNN model to map from semblance cubes to velocity models took 
about a day to finish. If larger models are used this may increase. Regardless, one 
may conclude that both methods are about equally efficient as both take on the order 
of days to finish. But, herein lies a critical difference between the two approaches; 
the CNN model is reusable. Once the training is finished, mapping from a single 
new dataset to a velocity model is nearly instant. Whereas, mapping a new dataset 
using multiscale FWI would take days to finish. The cost of the ML approach is all 


Fast and Accurate Seismic Tomography via Deep Learning 


Ground-truth (GT) 
model 1 


z [meters] 


0 250 500 750 Lo 
= [reehers| 


Ground-truth (GT) 
mca! | 








4500 
- 4000 — 
z 

aso0 E 
= 
3000 
= 
2500 > 





#000 
O 250 500 750 1000 o 
% [meters] 


igo 


250 500 750 1000 
* [meters] 


mS Fal 


Ati tt mM) 


m0 & rT 


yl 


5 | 


9 
s 


—_— 
72 Bo) i 
| 


¥ 





|= 
a [reed] 


Bo) 1K) 


Pala e = 
i 8h 1600 o 2 


0 7 3h 
o Lrreebera| 





i 
=—40 =20 a 20 


151 


Error histogram 








Frequency 


Velocity error [36] 


Ennar histogram 


Bo | 
La] | 16 > 


#04 o4e 





pera 


02 


20 


i 
Veleciny enror [*] 


Enror histogram 


Ground-truth (GT) 
mene! 2 


GT = FW 





—. oe — a4 
6 2 400 665 S800 bite 
a [meters] 


io 





8 aff 1600 
© [rreeters| 


o 20 
= [meters] 





Fig. 16 Comparison of tomography results from the DL and FWI for model 0. Leftmost column 
shows ground-truth (label), second from left shows the prediction from the DL (top), the multiscale 
(MS) FWI result (middle) and the standard FWI result (bottom). Third column from left shows the 
difference between the ground truth and the prediction as a percentage of the velocity error. The last 
column shows the percentage of velocity errors for each sample binned and plotted in a histogram 
form. When comparing the prediction of the DNN to the MS FWI result, we observe that the DNN 
has difficulty in resolving sharp interfaces. Also note that a MS FWI approach was necessary to 
avoid cycle skipping that is apparent with the conventional FWI result 


upfront and can be reused an infinite amount of times to make instant predictions. 
Nothing about FWI is reused and each application is equally expensive. 

We show anew way of doing tomography with ML that leaves human biases and 
reoccurring high computational cost behind. While the ML results are competitive, 
they are still beat by a regularized FWI method. But, our ML method is also in it’s 
infancy and FWI has been in development for over 20 years. Further progress may 
reap a ML method that can outperform FWI on all fronts, including model quality. A 
synergistic approach that utilizes both techniques is also an interesting, and a more 
realistic proposition. Using the unbiased results of ML as a starting model, FWI 
could fill in the remaining, sharp contrasts with fewer required iterations. This could 
quickly produce high quality models completely void of human bias. The broader 
case we make here is for the revolution of workflows in industry exploration. We see 
potential for many intermediate steps to be absorbed by ML-driven approaches, and 
seismic tomography is a stepping stone towards that. 
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6 Feature Extraction-Free Results 


Human biased feature extraction is not desired when truly following the deep learning 
paradigm, which encourages an end-to-end learning process which maps from the 
relevant elements of the raw data to the ground-truth. After the initial success of the 
semblance-based approach, experiments were conducted with a modified version of 
the network (as depicted in Fig. 9) that accepts seismic gathers without manipulation 
as inputs (Fig. 17, right). The label dataset (Fig. 17, left) is composed by the velocity 
models as described in previous sections. The main change in the network, compared 
to the one presented in Sect. 4, is in the input layer. Now the input is the raw seismic 
shot gathers which are of the dimension (number of shots x number of receivers x 
time samples). Each data/label pairing is composed by the later described 3D seismic 
gather and the corresponding velocity model as label. Furthermore, this network’s 
training used the Nesterov optimizer ADAM with a learning rate of le-03, batch 
size of 20 (per GPGPU) and the experiments where executed for 250 epochs using 
the MSE loss function. Training takes less than two hours and inference takes only 
seconds. 

It can be observed in Fig. 18 that the velocity model used as label is larger than the 
ones used in Sect. 4 and much more rich in features (velocity variation), this is because 
these velocity models belong to datasets used in exploration in the Gulf of Mexico 
(due to confidentiality reasons actual geographical locations can not be shared). 
Consequentially, the generated seismic datais much closer to what actual field records 
look like. The decision of using this data is not random, the final purpose is to expose 
the ML approach to real field data and therefore cross the threshold from research 
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Fig. 17 (left) Example velocity model from the training dataset, (right) examples of corresponding 
seismic traces (only three selected sources), obtained by wave propagation and without first arrival 
removal, where vertical axis represents time and horizontal is offset from source location 
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Fig. 18 Results examples for a trained model without pre-computed features. (top, left) shows the 
ground-truth and (top, right) the corresponding prediction. (bottom, left) comparison of the velocity 
profile for x = 500, vertical axis represents depth in meters 


into industrial-tested tool. One step in data preparation is to downsample the label 
and data, in particular the data was downsampled to fitin GPGPU memory, which for 
these experiments are three NVIDIA V 100s with 16 GB of internal memory each. The 
input data dimension used is 31 x 256 x 300 (where dimension where described in 
previous paragraph). The label and predicted model dimension is 100 x 100 samples, 
where the first dimension represents the horizontal axis and the second dimension 
represents depth or vertical dimension. The total size of the training dataset 1s 960 
samples, where 80 sample where separated for validation and another 80 samples 
for testing. 

The results training no pre-computed features (in Fig. 18) are at least comparable 
if not superior to the ones presented in Sect. 4. Comparable in the sense that all major 
features of the expected reconstructed models are present and the error ranges are 
similar. In quantitative terms, the test dataset SSIM metric is 0.8181 and the R? 
metric is 0.8272. This two figures are slightly less impressive that the ones reported 
in Sect. 4, three main factors are the culprit: complex velocity models, smaller dataset 
and forced downsampled data. In qualitative terms, the largest errors appears around 
the fine-grained contours of salt bodies, which is also the case for the traditional 
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techniques as is shown in Sect. 5. Nonetheless, the results are superior in the sense that 
the expected reconstructed models are more detailed and therefore harder to predict, 
also these velocity models are essentially what is used in exploration geophysics 
within seismic imaging workflows. 


7 Conclusions 


This chapter presents a novel DL approach to a key geoscience problem [5, 37]. By 
utilizing DL, it is possible to predict earth models directly from the recorded seismic 
data. Essentially, we are replacing an nonlinear inverse problem with a data-driven 
learning process. Results with synthetic data achieve high visual accuracy, both 
with structural similarity image metric (SSIM) and PSNR. This solution enables 
fast turnaround of exploration workflows that nowadays take weeks to complete, 
therefore empowering domain experts allowing them to focus on the most complex 
prospects within the data. The proposed approach can be extended to other relevant 
geoscience problems where accurate earth model are also required. Future work is 
twofold: extension to 3D tomography, namely reconstruction of three dimensional 
subsurface models, and validation with field recorded seismic data. 
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Traffic Light and Vehicle Signal M®) 
Recognition with High Dynamic Range ot is 
Imaging and Deep Learning 


Jian-Gang Wang and Lu-Bing Zhou 


Abstract Use of autonomous vehicles aims to eventually reduce the number of 
motor vehicle fatalities caused by humans. Deep learning plays an important role 
in making this possible because it can leverage the huge amount of training data 
that comes from autonomous car sensors. Automatic recognition of traffic light and 
vehicle signal is aperception module critical to autonomous vehicles because a deadly 
car accident could happen if a vehicle fails to follow traffic lights or vehicle signals. 
A practical Traffic Light Recognition (TLR) or Vehicle Signal Recognition (VSR) 
faces some challenges, including varying illumination conditions, false positives and 
long computation time. In this chapter, we propose a novel approach to recognize 
Traffic Light (TL) and Vehicle Signal (VS) with high dynamic range imaging and 
deep learning in real-time. Different from existing approaches which use only bright 
images, we use both high exposure/bright and low exposure/dark images provided 
by a high dynamic range camera. TL candidates can be detected robustly from low 
exposure/dark frames because they have a clean dark background. The TL candidates 
on the consecutive high exposure/bright frames are then classified accurately using a 
convolutional neural network. The dual-channel mechanism can achieve promising 
results because it uses undistorted color and shape information of low exposure/dark 
frames as well as rich texture of high exposure/bright frames. Furthermore, the TLR 
performance is boosted by incorporating a temporal trajectory tracking method. To 
speed up the process, a region of interest is generated to reduce the search regions 
for the TL candidates. The experimental results on a large dual-channel database 
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have shown that our dual-channel approach outperforms the state of the art which 
uses only bright images. Encouraged by the promising performance of the TLR, 
we extend the dual-channel approach to vehicle signal recognition. The algorithm 
reported in this chapter has been integrated into our autonomous vehicle via Data 
Distribute Service (DDS) and works robustly in real roads. 


Keywords Traffic light recognition - Vehicle signal recognition - High dynamic 
range imaging - Deep learning - Autonomous vehicle - Data distribute service 


1 Introduction 


Traffic Light Recognition (TLR) locates the traffic light from an image and then esti- 
mates the status of the light signal. Vehicle Signal Recognition (VSR) estimates the 
signal of the vehicles ahead from an image. Automatic recognition of traffic light and 
vehicle signal are two of perception (functionalities) for Advanced Driver Assistance 
Systems (ADAS) or Autonomous Vehicle (AV) because failure of following traffic 
light or vehicle signal could lead to a fatal accident. There have been lots of studies 
on TLR. However, not much attention has been paid to practical TLR problems. The 
most challenging issues of TLR include computation time, day/night lighting condi- 
tions, confusion of tail lights or other kinds of ambient light, low image resolution and 
vehicle occlusion. Most of existing TLR approaches are sensitive to lighting condi- 
tions because only bright images are used. In this chapter, in the premise of ensuring 
real-time conditions, we are interested in TLR problem under varying lighting con- 
ditions and confusion of tail lights. A two-stage approach is proposed to solve the 
problems: detect traffic light candidates in low exposure/dark image which is less 
sensitive to lighting conditions and then recognize their traffic light state in high 
exposure/bright image which has rich texture. Deep learning is adopted to improve 
the recognition accuracy significantly. 

Some good surveys on TLR can be found in [1-3]. The existing TLR methods 
can be roughly divided into three categories: (1) template matching; (2) circular 
extraction; and (3) color distribution. In the first category, templates of red or green 
light are matched with the extracted regions. The circular shape is detected from 
images by using Hough transform in the second category. The third category is 
mainly color segmentation. One of the major disadvantages of these three categories 
approaches is the high sensitivity to lighting conditions. Color and shape information 
are used to detect TL candidates [4-7]. Some image preprocessing is applied to prune 
the candidates before being fed to the classifier. Before pruning candidates using 
shape, temporal, edge and symmetry, image segmentation in HSV [8] or RGB [9] 
space is adopted. In order to recognize the traffic light states robustly, an adaptive 
template matching method is proposed [10]. 

In order to improve detection accuracy, region of interest (ROJ) is used to reduce 
the search region. Map and GPS (annotated) are used to generate ROI [11, 12]. Some 


Traffic Light and Vehicle Signal Recognition ... 159 


non-passive approaches, e.g. vehicle-to-light or car-to-car [13—15] are also proposed. 
However, they are not widely adopted because special infrastructures are required. 

High Dynamic Range (HDR) imaging is able to reproduce images with greater 
range of illuminance than traditional imaging technology. This can be done by cap- 
turing and then combining several different, narrower range, exposures of the same 
subject matter [16]. Traditional cameras capture images with a limited exposure 
range, referred to as LDR. Compared to HDR, the LDR results in the loss of detail 
in highlights or shadows. 

We use HDR in a different way than existing HDR research. Instead of generating 
HDR images from multiple exposure images, we use images at different range of 
illumination levels independently. A dual-channel method is proposed to detect and 
recognize traffic lights and vehicle signals. The low and high exposure images are 
used to detect light and recognize light status, respectively. 

We note that the detection of lights from a lower exposure/dark image 1s much 
more robust than that from a high exposure/bright image (sensitive to environment 
illumination). In this chapter, a HDR camera, which can provide more than one 
image with different exposures, is used. As far as we know, this is the first time 
HDR camera is used in traffic light or vehicle signal recognition. The authors [17] 
used two channels by fusing simple color thresholding segmentation and SVM [18] 
(Histogram of Gradient features are used). They conducted some experiments on 
urban scene images but have not reported accuracies in their paper. As we know, 
the thresholding color segmentation is very sensitive to the outdoor illumination. 
Compared to deep learning features [19], hand-crafted features, like HOG, cannot 
expect higher accuracy even if more training samples are provided. In this chapter, 
we extensively investigate the HDR imaging and deep learning with the applications 
in TLR and VLR. The TL candidate detection is executed with a low exposure/dark 
image which is fast and robust to the environment illumination. Once TL candidates 
are found, the TL states can be identified by applying machine learning algorithm, 
e.g. Adaboost [4]. Deep learning is a state of the art machine learning methods. 
The advantage of deep learning over traditional technologies is the accuracy is not 
saturated with growing number of training samples. It has been found that the accu- 
racy of the traditional machine learning, e.g. SVM, will be not improved even when 
more training samples are added. In addition, deep learning has applied widely as 
the computer hardware progress makes it feasible for real-time applications. In this 
chapter, given a TL candidate region, their counterpart in bright image is passed to a 
Convolutional Neural Network (CNN) to identify the traffic state. Furthermore, the 
TLR accuracy is improved by incorporating a tracking technology. 

The traffic light and vehicle signal lights recognition [4, 20, 21] is essentially a 
computer vision problem. Machine learning could be enhanced by some preprocess- 
ing or post processing, e.g. image processing or geometric estimation. As the normal 
road is flat, the number of the TL candidates could be reduced by some geometric 
properties, e.g. the perspective projection between the camera and the road. This can 
save computational cost significantly because each candidate needs to be classified. 

Figure | shows the diagram of our traffic light recognition system. HDR camera 
makes it possible to set different parameters for different channels to cover dynamic 
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Fig. 1 Diagram of the traffic light recognition system [22] 


range. To speed up the recognition processing, the candidates are pruned by saliency 
map and region of interest (ROI). The ROI in our approach is determined by: (1) 
calibrating the camera with respect to the ground world coordinate; (2) the knowledge 
about the physical heights of the TLs. Finally, based on temporal trajectory analysis, 
we develop a tracking technology. In doing so, the robustness and accuracy of the 
TLR have been improved. 

This chapter is a technical summary of three papers [20, 22, 23] which respectively 
uses related technologies for TLR and VSR. The original methodologies can refer 
to these papers. 

The remainder of the chapter is organized as follows. The HDR imaging based 
traffic light detection will be discussed in Sect. 2. The CNN traffic light recognition is 
discussed in Sect. 3. In Sect. 4, tracking technology is discussed. The experimental 
results are given in Sect. 5. The extension of the dual-channel method to vehicle 
signal recognition is discussed in Sect. 6. Conclusion and future work are discussed 
in Sect. 7. 


2 HDR Imaging Traffic Light Detection 


2.1 HDR Imaging 


High Dynamic Range (HDR) imaging is the compositing and tone-mapping of 
images to extend the dynamic range beyond the native capability of the capturing 
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device [16]. HDR technology has been successfully applied in both photo and TV 
to make images/frames have a greater contrast between bright and dark. Different 
from the existing applications of HDR which enhance visual quality by combining 
bright and dark images, we use dual-channel separately and combine them using the 
association between the two channels. 

The motivation for us to adopt HDR is that the higher detection ability of the dark 
images and higher recognition ability of the bright images. Although the dark and 
bright image are not exact captured simultaneously, the relatively time difference 
between them is short enough to be neglected. In other words, we can easily find the 
corresponding regions between bright and successive dark images, and vice versa. 
This helps us associate the detected traffic light candidates in the dark image with 
their location in the bright image. 

As mentioned, the dark images are used for traffic light candidate detection and 
bright images are used for recognition. As for vehicle signal recognition, the bright 
images are used for vehicle detection and dark images are used for vehicle signal 
recognition. 


2.2 Dark Images for Detecting Traffic Light Candidates 


It is an essential step to detect TL candidates from images for a successful traffic light 
state classification/tracking system. Currently, most of TLR systems detect traffic 
light using only bright images. However, similar to other detection problems, their 
performance is very sensitive to the environment lighting conditions, and confusion 
with the tail lights of vehicle ahead or other similar ambient light, for example, traffic 
sign, temporary roadblocks, pedestrian. How to robustly detect traffic light candidates 
under varying illumination is still an open problem. In this chapter, instead of using a 
single image, we propose a novel method to use dual channel (low and high exposure, 
respectively) provided by a HDR camera. Unlike previous HDR imaging in which 
a single image is synthesized from bright and dark channels, in our approach these 
channels are used separately to detect and recognize traffic lights. As mentioned in 
Sect. 2.1, a HDR camera, with more dynamic ranges than a normal camera, can be 
used in such a way that the successive two channels can be set as high exposure 
and low exposure, respectively. The traffic light candidates detected in a dark image 
can be located easily in bright channel as they are captured within a very short 
time interval, about 40 ms for our camera having 25 fps with high-definition serial 
digital interface (HD-SDI). The association between the candidate regions on dark 
and bright images is not affected largely by high speed moving. Anyway, a way to 
re-locate the TL candidate detection results on the bright image is proposed in this 
chapter, and will be discussed in Sect. 3.1. 

The way we use the HDR imaging makes the traffic light candidate detection 
more robust than others because the lights are with a clean dark background on low 
exposure images. By using this HDR dual-channel mechanism, undistorted color and 
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Fig. 2 Traffic light detection and recognition with dual-channel mechanism [22]. a High expo- 
sure/bright image; b Low exposure/dark image; ¢ Dark image with saliency map of the ROI; 
d Traffic lights candidate detection and recognition results. The traffic light state result is displayed 
in the upper right 


shape information on a dark image and rich context information on a bright image 
are fully used. 

Figure 2 shows an example of the dual-channel TLR. We can see that the lights, 
including traffic lights and vehicles’ tail lights, are prominent in the dark image. The 
rich context can be seen from the bright image. 

Low lighting conditions is a challenging issue in using HDR to recognize traffic 
light. Traffic light candidates are detected from dark image by a simple color thresh- 
olding segmentation in [17]. The detection performance could be unreliable as it is 
hard to adapt the varying illumination conditions with a threshold. A saliency map 
filtering, aims to simplify and/or change the representation of an image into some- 
thing that is more meaningful and easier to analyze [24], is adopted in this chapter 
to handle low lighting problem. 
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2.3 Saliency Map Filtering 


Most of the existing traffic light recognition methods detect traffic lights (color blobs) 
by tuning thresholds. The color information is used for locating and identifying traffic 
light states. YCbCr [25], instead of RGB, is considered for this purpose because the 
color and intensity are mixed in three channels of the RGB color space. 

Usually, the parameters used to identify the traffic light states (red, green and 
amber) are very sensitive to environment lighting conditions. As verification needs 
to be done for each pixel in order to determine the state, the time consumption is 
linearly increased with the number of colors. In order to speed up the process, a 
non-parameter model is proposed in this chapter to extract blobs of various colors 
simultaneously. RGB color space is used in this chapter in order to illustrate the 
robustness of our method although the performance could be better when HSV or 
other spaces is adopted [17]. 

Our method contains followings steps. Firstly, the 3D RGB color space is divided 
into grids, M x M x M. M 1s set to be 32 (without fine tuning) in this chapter. 
Secondly, the histograms for each state, including red, green and amber colors, are 
calculated from samples. Let’s define normalized histogram [0, 1] for red, green 
and amber as H,, H, and ..., respectively. Those values above 0.1 in H,, H, and 
H, are truncated to prevent extreme dominance of a single color bin. The resulting 
histograms are renormalized to [0, 1]. Given an input image, I, the saliency score of 
a pixel (i, 7) in red channel is computed as 


Sapj=] + 0,7) (1) 


(J )ENaG J) 


where N (i, 7) represents neighborhood of pixel (i, 7) within a maximal distance of d. 
A saliency mask can be obtained by applying a threshold, J to S,. In this chapter, d 
is set to be 2 and T is set to be 0.2. No fine tuning is done for these settings. Although 
saliency maps can be computed individually with the histogram models for different 
light types, it is computationally redundant to compute the saliency score of each 
pixel for each color. In our approach, a Max operator is proposed to combine the 
histograms of three traffic light states (red, green and amber): 


AH = max(H,, Ag, Aq) (2) 


A final saliency map, S, is obtained by replacing H, in Eq. (1) with H. 


SEN= >, HGF) (3) 


@,F)ENaG J) 


If the saliency value of a pixel is found to be above the threshold, then the channel 
saliency scores will be re-computed using the three channel histogram models. The 
pixel is assigned to a type which is with the maximize channel saliency score. 
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(a) Raw bright frame (b) Raw dark frame 





(c) Saliency map image (d) Saliency image with color labels 


Fig. 3 Saliency map [22]. a High exposure/bright image; b Low exposure/dark image; ¢ Saliency 
map of (b); d Saliency map with color label 


Following the way mentioned-above, most of the pixels could be filtered out by the 
final saliency score, and the types of other pixels could be determined by individual 
saliency models. Figure 3 shows an example of the proposed saliency model. 

Function findContours() in the OpenCV [26] is used to extract contours of the 
blobs from the resulting binary image. Some obvious incorrect blobs can optionally 
be removed based on shape analysis, e.g. area or circularity. 


2.4 Auto Exposure for Uncontrolled Illumination 


Although many researchers have devoted research to lighting problems in computer 
vision, it is still a challenge for a vision system to work robustly under varying light 
conditions. As mentioned above, adjusting camera exposure could be an efficient way 
to detect traffic light. Zebra2 (2.8 MP Color GigE Vision) [27], the HDR camera used 
in this chapter, provides auto-exposure. However, this function will be disabled when 
high dynamic range setting is activated. TL candidate detection could be unstable 
because for a real scene, e.g. sunlight and skylight, dynamic range may be wider 
than that for the camera setting. The severe illumination changes under uncontrolled 
outdoor environment have to be considered although saliency map could make the 
detection under large illumination variations more reliable than a simple threshold. 
Auto exposure, i.e. automatically adjusting exposure parameters, e.g. gain or shutter 
speed, should be helpful to keep image features. Here, auto exposure for dual channel 
will be considered. 
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Auto exposure has been investigated in literature [28—30]. However, we need a fast 
auto exposure approach for our autonomous vehicle application. In this chapter, we 
propose a real-time auto exposure approach. The exposure is adjusted by observing 
the difference between the mean intensity of an image mask and a reference value. 

Let J, represents an expected mean image intensity, 7. represents the mean inten- 
sity of the current frame, a factor is defined as 


ft 4 
faz (4) 


Our objective is to let f in Eq. (4) tend to I, 1.e. J, tends to /;, by updating gain or 
shutter. 

To obtain expected f, shutter and gain are jointly adjusted within their respective 
ranges [Smin, Smax] and [{min, Zmax]. In actual implementation, the shutter is adjusted 
before gain, as noise could come together with a large gain value. The adjustment of 
shuttle will result in a factor: 

St 


f= (5) 


Sc 


where s,. represents current shutter value, s, and represents the updated shutter value. 
It is known that the shutter value is directly proportional to intensity. If desired factor 
f can be achieved by only adjusting shutter within its range, 1.e. f; =f, then no gain 
adjustment is needed. However, iff, cannot lead to a targeted image intensity, then the 
shutter will be updated to its extreme within the range, and the gain will be adjusted 
to cover the remaining portion of the factor, i.e. f =f f,, where f., is a gain factor. 
The remaining factor can be achieved easily by adjusting gain based on a common 
observation: when the gain adjustment (increase or decrease) approximately 6 db, 
the intensity doubles or halves. 


2.5 Region of Interest (ROD 


Real-time is a requirement for a practical TLR system. One way to speed up TL 
detection is to have some prior knowledge about the traffic lights’ location on an 
image. By calibrating the camera with respect to 3D world, aregion can be determined 
on an image which could contain traffic light candidates. 

The prospective projection of a camera can be represented as a transform matrix. 
The relationship between the 3D world and 2D image can be represented as a trans- 
formation matrix: 
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where (x, y, z) and (u, v) represent world coordinates and image coordinates, respec- 
tively. In this chapter, the world coordinate system is defined XY lies on the ground 
and the Z-axis is upward and perpendicular to XY. The origin of the XYZ corre- 
sponding to the frontal middle point of the host vehicle; the X-axis towards the front 
of the host vehicle. The Y-axis towards the left to make the X YZ follows right-hand 
rule. The calibration is a process to estimate the eleven unknown parameters, a, in 3 
x 4 matrix of Eq. (6). At least four groups of 3D and 2D coordinates are needed for 
this purpose. In practical, more than four groups of 3D and 2D coordinates can be 
used to calibrate the camera. The eleven parameters are estimated from these groups 
of data by solving a least-square fitting problem. In this chapter, these groups of data 
are obtained by reading the coordinates of a few known-size calibration objects. 

The location of the TLs on an image could be estimated when the vehicle’s 
localization and 3D localization of TLs on the map are given. However, high accuracy 
map and localization estimation are needed for this kind of method, makes it hard 
to be adopted in real practical application. In this chapter, instead, accurate map and 
vehicle localization, the region of interest (ROI) method is adopted to speed up the 
TL detection. 

In our experiments, we have not made any assumptions about map, traffic lights’ 
location or host vehicle’s pose. When no localization information is provided, a 
roughly ranges in x, y, z direction with a very roughly 2D GPS position is still useful. 
Based on these 3D ranges, an intensive 3D grid can be made and 2D image ROI can 
be correspondingly generated. In other words, a ROI, corresponding to the longest 
distance, is adopted where the traffic lights candidates can be found. 

In this chapter, the detection range in XYZ for a vertically hanged traffic light 
is defined as follows. X (longitudinal): [0 m, 70 m]; Y (lateral): [—8 m, 8 m] and 
Z (upward): [2.5 m, 4 m]. These parameters are set based on normal traffic light 
cases in Singapore. To estimate possible ROI, (x, y, z) may change within range. 
Figure 4g and h show two detection masks or ROI for horizontally hanged TLs, 
obtained by changing Z within [4.5 m, 7 m]. In the case that either vehicle pose or 
TLs location is accessible, such ROI could be further shrunk. More examples for 
ROIs corresponding to different ranges are shown in Fig. 4a-f. Figure 4g is used as 
the ROI. 

With the help of the ROI, the computation cost for detecting traffic light candi- 
dates can be reduced significantly. We can see this from the experimental results 
to be discussed in Sect. 5. Low computation cost is very important for real-time 
applications, such as autonomous vehicles, which need to run a few models, e.g. per- 
ception, navigation, simultaneously. Besides time saving, the traffic light recognition 
accuracy has been improved since some false positives located outside the ROI can 
be prevented. 
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(a). x{0, 10], yO, 8], z{2.5, 4] (b). xf10, 20], yO, 8), z[2.5, 4] (c). xf20, 30), ylO, 8), 2{2.5, 4] (d). x{30, 40], y[0, 8], z{2.5, 4] 


(e). x[40, 50], y[0, 8), 2[2.5, 4) (f). x[50, 60), y[0, 8), z[2.5, 4] (g). x[0, 60), y[-8, 8), 2[2.5, 4) (h). x[0, 60), y[-8, 8), z[4.5, 7] 













Fig. 4 Eight ROIs for different x, y and z [22]. The ROI in (g) is used in our real experiments 


3 Traffic Light Recognition with Deep Learning 


Thanks to large data available and hardware progress, deep learning, as a state of the 
art machine learning technology, has achieved very promising results in computer 
vision (e.g. object detection, data augmentation), speech recognition and natural 
language etc. [31]. Hierarchical representations of training data rather than hand- 
crafted features can be learned by a deep architecture. 

In this chapter, we adopt deep learning to recognize TL status from images. Similar 
to other deep learning applications, the idea in this chapter is that a convolutional 
neural network (CNN) is able to classify a TL candidate into a TL state efficiently. In 
this chapter, we have shown that it is possible to develop a real-time high accuracy 
TLR system when a deep model as well as parameters are designed carefully. 

As we discussed in Sect. 2, the location of the traffic light candidate on the bright 
image can be determined by their locations on the dark images because the interval 
time between the successive bright/dark frames is very short and can be neglected. 
First of all, we will discuss the correspondence between the bright and dark channels 
in the next section. 


3.1 Dual Channel Mechanism 


As we know, the two images that captured via low exposure and high exposure 
channels are not synchronized, i.e. they are not captured simultaneously, although 
the interval between the two channels’ timestamps 1s very short. The vehicle’s motion 
makes it hard to align the TL candidate detected from dark image with a bright image, 
especially when the vehicle’s vibration (due to movement) cannot be ignored. Hence, 
a way is needed to re-locate the TL candidate detection results on the bright image. 

With the detected candidates on the dark frame, we aim to find the corresponding 
regions on the next bright frame which is with richer texture. Considering the time 
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interval between the consecutive frames, new region centre on the bright image could 
be needed to ensure the regions cropped from the bright image are corresponding 
to the TL candidates. In this chapter, the center position, p, and radius, r, of the 
TL candidates could be used to estimate the new center on the following bright 
frame. In details, the new centre is searched within a window, 12r x 12r in this 
chapter, centered at p. The centres of the TL candidates is normally with the highest 
brightness value and color variance among the pixels within the window. For RGB 
space, brightness image, I, is computed as 


I = 0.2126 « R+ 0.7152 * G+ 0.07221 * B (7) 
and the variance image, V, is computed as 
V=|R-J|4+|G—-J|4+|B-]1| (8) 


A new center can be found as the highest response in a weighted sum image 
[32, 33]: 


aV+(1—a)l (9) 


where @ is a weight. As the brightness changes significantly when the lighting con- 
ditions changes, o is set to be 0.7 in this chapter. As mentioned above, a 12r x 
12r window centered at each new center, is cropped from bright frames and used as 
candidate regions for TL state classification. 


3.2 Customized Convolutional Neural Network 


False positives could be possible, e.g. caused by braking light of the vehicles ahead 
or other objects with color similar to TL, although most of them can be removed 
during TL candidate detection stage. In order to improve robustness furtherly, a 
CNN classifier is applied to identify true positives from false positives. 

The accuracy and speed are two considerations for us to select a CNN classifier. 
As one of the perception models runs in autonomous vehicle, the running speed is 
an important issue for selecting deep learning model because it will share limited 
resource with other models, e.g. object detection, lane detection, etc. CaffeNet [34], 
see Fig. 6, is a 1-GPU version of AlexNet [35] in which the two paths in AlexNet 
are combined to become one path. A customized version CNN model, similar to 
CafteNet [34], 1s adopted in this chapter. The number of output layer (last layer) is 
defined as 13, i.e. twelve positive classes and one background class. The positive 
classes are defined based on possible traffic light types. 


(1) HARL Horizontally Aligned Red Light 
(2) VARL Vertically Aligned Red Light 
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(3) HAGL Horizontally Aligned Green Light 
(4) VAGL Vertically Aligned Green Light 
(5) LVL Left Vehicle Light 

(6) RVL Right Vehicle Light 

(7) GAL Green Arrow Light 

(8) RAL Red Arrow Light 

(9) AL Amber Light 

(10) GPL Green Pedestrian Light 

(11) RPL Red Pedestrian Light 

(12) OFRL Other Fake Red Light 


Figure 5 shows the annotation results on an image where the true positives are 
labeled in blue and false positives are labeled in red. 

The within-class variance can be reduced by above definitions. For example, by 
defining horizontal and vertical lights, red traffic lights can be distinguished from 
false positives, e.g. LVL, RVL and OFRL. Another advantage for this definition is 
the reduced effort in data collection and annotation. 

CaffeNet contains eight layers. The first five layers are conventional layers which 
transforms one volume of activations to another through a differentiable function. The 
convolutional layer’s parameters consist of a set of learnable filters (convolutional 
kernels). Each filter is small spatially (along width and height), but extends through 
the full depth of the input volume. Two processes: max pooling and local response 
normalization, are added in the first and second layers. The last three layers are 
fully- connected layers. According to the architecture in Fig. 6, there are 60 million 
parameters need to be trained. 

The CNN classifier’s weights are trained with fine tuning strategy. The parameters 
needed in the training procedure are set based on experiments. In our approach, the 
basic learning rate and decay of the CNN model are set to be 0.001 and 0.0005, 


Fig. 5 Annotate TL samples 
on an image [22] 
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Fig. 6 The architecture of CaffeNet 


respectively. For modified layers, the multipliers of learning rate, 1.e. first convolu- 
tional layer and output layer, are set to be 10 in the first 2000 iterations, and | for the 
other layers. In total, 50000 iterations are taken in the training procedure. 


4 Temporal Trajectory Analysis 


In this chapter, instead of using normal tracking technologies, such as Kalman filter 
or particle filter, we propose a simple but efficient temporal trajectory analysis for 
improving accuracy and robustness of traffic light recognition. 

As mentioned in Sect. 2.4, a HDR camera, Zebra2 [27], is used in this chapter. 
This high speed camera ensures that the targets found on images constantly change 
from frame to frame. In this scenario, temporal spatial analysis, a process to examine 
if a target detected in the current frame has ever been found in the nearly same area 
of the last frames, can be used to track the targets. The traffic is controlled by keeping 
traffic light status for a certain period of time. As a result, the regions of the light are 
spatially continuous on the image sequence no matter that vehicles are moving or 
keeping static. Based on these observations, traffic light recognition can be benefited 
by proper temporal spatial tracking in two aspects: (1) smoothness is improved as 
missing or low confident traffic light status could be filled up; (2) isolated false 
positives could be removed. 

We define trajectory as the history of a traffic light instance. A trajectory consists 
of several components: 


(1) type; 

(2) history locations; 
(3) lifetime; 

(4) discontinuity. 


The trajectory is grouped according to light status and stability. Hence, six types 
of trajectories are defined: 


Traffic Light and Vehicle Signal Recognition ... 171 


Fig. 7 The trajectory of a 
vertically green light 
(marked as dots in green) is 
plotted onto one frame [22] 


vert green 





(1) stable red; 

(2) stable green; 

(3) stable amber; 

(4) temporary red; 
(5) temporary green; 
(6) temporary amber. 


For example, “stable red” refers to the trajectory is confirmed as red traffic light 
status (either horizontally or vertically red light). Another item, lifetime, depicts 
the period of the trajectory since the first instance is detected. An example of the 
trajectory analysis is shown in Fig. 7. The trajectory of a vertically green light is 
plotted onto one frame. 

The trajectory pool is updated continuously. At the very beginning, once a new 
target is found, a temporary trajectory will be initialized for it. To update a temporary 
trajectory to a stable trajectory, a minimal lifetime (one second in our experiments) 
and a minimal number of instance (five in our experiments) are required. It should 
be noted that these parameters are set based on our experiments. A trajectory will be 
deleted from the pool if its lifetime is longer than a threshold (seventy seconds in our 
experiments). In the current traffic light control system, the lifetime of red, green or 
amber lights is below this threshold. Sometimes, the red light could last longer than 
this threshold, it should be feasible to split the history into two trajectories in this case. 
When anew frame is given, traffic lights are detected. The new targets are then added 
to the pool. Assume a red light status is recognized from a frame, those trajectories 
in the pool having red light will be checked. The new location will be added into a 
trajectory if the distance from the new location to the trajectory is minimum among 
the red trajectories in the pool and below a pre-defined threshold (sixty-pixel in our 
experiments). A new temporary red trajectory will be created starting from this new 
target if no such trajectory can be found. The new target becomes a stable red light 
if a stable trajectory is found. Otherwise, the new target (could be a false positive) is 
recorded as a temporary red light. 
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5 Experimental Results 


In this section, the quantity analysis (in terms of precision and recall) of the proposed 
method on a large database will be conducted. The comparison with the state of the 
art is provided to show the advantages of our approach. The experiments on both 
database and real roads have shown that our TLR method satisfies the speed and 
accuracy requirements of an autonomous vehicle. 


5.1 Evaluation of Performance 


One of the standard performance evaluation methods is precision and recall. They 
are defined as Eqs. (10) and (11). 


_ TP 
Precision = —————— (10) 
TP+FP 
TP 
Recall = —————— (11) 
TP+FN 


where 7P represents number of the True Positive samples, FP represents the number 
of the False Positive samples and FN represents the number of the False Negative 
samples. 

In this section, the quantitative analysis of our method in terms of precision and 
recall is conducted. For this purpose, a large database has been collected using our 
autonomous vehicle. The number of the true positives, false positives and false neg- 
atives are computed for computing precision and recall with Eqs. (10) and (11). 

The database contains 4,142 images. The images are selected such that each class 
contains nearly same number of samples. A total of 21,070 boxes were manually 
annotated on these images. There are about 1,750 boxes for each class. The training 
set consists of 3,722 (about 90%) images. The evaluation set contains 420 images. 
In order to train network, we generate about one million samples from above seed 
samples. The scale and translation technology is adopted for generating new samples 
from seed sample. Although there is no special requirement about the generation 
of the training samples, during the experiments, we found that the performance is 
affected by the balance among the number of samples for each class. The generation 
of the new samples is presented as follows. 

In this chapter, the resolution of the original image is 1600 x 1200. The uniform 
random distribution is adopted for shifting and scaling original TL region to generate 
new training samples. The region center of the traffic light candidate are shifted from 
—(Q.2 to 0.2 times of the candidate rectangle’s width or height, and then resizing the 
region from | to 1.2 times. The new samples are finally resized to 111 x 111. 

To evaluate our system, the algorithm runs over 63 new video sequences (each 
video about 4 min long). The testing videos contain the samples under different 


Traffic Light and Vehicle Signal Recognition ... 173 


conditions: day time, weather, express way and urban road. 1,800 images, sampling 
interval 80 frames, are selected from above video sequences. The ground truth comes 
from these images contains 5,229 samples. Considering the ROI, there are 3,099 
samples. 

Table 1 gives the experimental results where the test results with the ROI are 
recorded in brackets. 

We can see from Table | that the vehicle signal light recognition accuracy is worse 
than that of other classes. The reason could be that there are not enough vehicle signal 
lights samples compared with the one of traffic light samples. Another reason for 
this could be that the vehicle lights’ type are much more than that of others. We have 
to collect more training samples to cover more kinds of vehicle lights if we want 
to improve vehicle signal light recognition accuracy. Anyway, by applying the ROI 
discussed in Sect. 2.5, most of the vehicle lights will be removed from the results 
because they are at the lower part of the images (see Table 1). 

Based on the results given in Table 1, the precision and recall are obtained and 
the results are given in Table 2. From Table 2, we can see that the accuracies of the 
average recall and precision are improved from 98.04% to 99.03% and from 97.45 
to 98.91%, respectively. The detection rate of traffic light candidate is computed as 
follows. 


D=(TP+FN)/G (12) 


where G represents the number of the ground truth. 

The ground truth for each class is given as the first row in Table 3. Based on 
Table | and Eq. (12), the detection rates are computed and listed in Table 3. The 
average detection rate is improved from 96% to 97.85 when ROI is applied, see 
Table 3. 


5.2. Comparison with State of the Art Algorithms 


In order to prove the advantages of our approach, the performances achieved by our 
approach and that of the state-of-the-art are compared. To the best of our knowledge, 
no publically available HDR TLR benchmark database can be found for such com- 
parison. Most published TLR systems evaluate their method on their own databases 
collected using a single color camera. One exception we can find from literature is 
[17] which use multiple exposure images. They conducted experiments on several 
urban scenes, but there was no accuracy reported. 

Nevertheless, the performances achieved by our approach and the state of the 
art deep learning object detection approach are compared in this chapter. For this 
purpose, the results obtained by using only high exposure images of our test data 
will be compared. To make the comparison fair, we use the same training database 
with our TLR to re-train the state of the art deep learning detector. 


J.-G. Wang and L.-B. Zhou 


174 


@e) || ||.) 


‘TaHtO 
‘Ida 
IdD 

IV 
‘Iva 
TVD 
‘TAd 
‘IAI 

“TOVA 

‘TOVH 

‘TUVA 

‘TUVH 


[ZZ] sJoyOvIg UI paps1OdaI o1e [OY OY) YIM SINsal oy) {JOY YWM/NoyIWM xLMeU UOIsNJUOD -YqGH T eAIGeL 


1g 


(OW YWA)/TON 

(6°86) (OOT) | (786) | ('Sé) (O01) (OO) | (8°96) (OOT) (OOT) | (8°96) (O01) (OOT) ynoyIM 
C'L6 L'16 786 1'S6 OO! OO! 8°96 86 8°76 8°96 OO! OO! (%) WOIstda1g 
(OW WWA)/TON 

(66) “a (OOT) | (6°76) ce a (OOT) (OOI) | (€L6) | (186) (O01) yoy IM 
86 OOT OOT 6°76 OO! £76 C86 £16 1°86 v'L6 (%) [[e99y 





Traffic Light and Vehicle Signal Recognition ... 





ey Tago") aw) ws [wp avo) aw) aa [ova [over ava | aver 


[ZZ] SJoyousg Ul papiodal dv [OEY YIM siNsor ay) ‘JOY YWM/INOYIUA [Voor pue uoIsioaid oY], 7 IVI, 


J.-G. Wang and L.-B. Zhou 


(OW YWA)/TOU 

(6°L6) (OOI) | (876) | (7 €6) (OOT) | (8°96) | (6°96) (OOT) (O01) (OOT) | (86) | (ZLé6) jNoyIIM (%) 
96 £76 8°76 7 £6 OO! 8°96 6°96 8°C6 L'v6 OOT 1°86 vL6 9]e1 UOT}99}0q 
(OW WWA)/TOU 

(6¢) (PLT) (TET) ss ol (9¢) (99) ie NOY 

6€ vLi CEI 76l 176 IIe qynd] punosy) 





ey Tao" aa ao [av wa ave ae) ant] ova [Town | Tava | aver 


[ZZ] syoyouig Ul popsOdal ov [OY YIM SINsor 1s91 oy) “JOY WIM/NOYIA oes UONDIIOG ‘YGH_ € FIGEL 


176 


Traffic Light and Vehicle Signal Recognition ... 177 


For an autonomous vehicle application, object detector is selected based on two 
criteria: (1) real-time; (2) high accuracy. Although there have been a few deep learning 
object detectors, like Faster-RCNN [36], most of them cannot be run in real-time. 
In this chapter, we adopt You Only Look Once (YOLO) [37] as state of the art to 
be compared with our approach. YOLO can run in real-time and its new version 
(YOLOv2) [38] has achieved better performance than others, like Faster R-CNN 
[36] and single shot multibox detector (SSD) [39]. 

The architecture of YOLOv1 is shown in Fig. 8. It has 24 convolutional layers 
followed by 2 fully connected layers. Alternating 1 x 1 convolutional layers reduce 
the features space from preceding layers. The convolutional layers are pretrained on 
the ImageNet classification task at half the resolution (224 x 224 input image) and 
then double the resolution for detection. The final output of YOLO is the 7 x 7 x 30 
tensor of predictions. YOLO looks at the whole image at test time so its predictions 
are informed by global context in the image. It also makes predictions with a single 
network evaluation unlike systems like R-CNN which require thousands for a single 
image. This makes YOLO is extremely fast, more than 1000 faster than R-CNN 
and 100 faster than Fast R-CNN. 

Some experimental results obtained by HDR and YOLOv2?2 are shown in Fig. 9a 
and b, respectively. The test results by using YOLOvz?, similar to the last section for 
HDR, are shown in Tables 4, 5 and 6. 

The true accuracies of the precision and recall are calculated by taking into account 
of the detection rate, i.e. multiplying the precision or recall rate with the detection 
rate, respectively. Table 7 lists the true precision and recall rate for YOLOv2 and 
our approach where the rates with ROI are recorded in brackets. Better performance 
can be achieved by our approach, either with or without ROI, compared with that of 
YOLOv?. 
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Fig. 8 The Architecture of YOLO [37]. YOLO has 24 convolutional layers followed by 2 fully 
connected layers. Alternating | x | convolutional layers reduce the features space from preceding 
layers. The convolutional layers are pretrained on the ImageNet classification task at half the res- 
olution (224 x 224 input image) and then double the resolution for detection. The final output of 
YOLO is the 7 x 7 x 30 tensor of predictions 
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Fig. 9 Some experimental results obtained by a HDR b YOLOV?2 [22] 
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Table 7 Comparison of the precision and recall between HDR and YOLOvz?,; the results with ROI 


are recorded in brackets [22] 
Recall (%) 94.1 (96.9) 
Precision (%) 93.6 (96.8) 


In details, with ROI, the improvement of the precision and recall are from 92.5% 
to 96.8% and from 94.3% to 96.9%, respectively. 

The use of dark channel for detecting traffic light candidate is an efficient way 
to prevent many false positives caused by the traffic signs, sunlight, clothes of the 
pedestrian etc. This is because their corresponding regions on the dark image are 
not very visible. Figures 10, 11 and 12 give a few examples show that some false 
positives detected when only single color camera is used can be prevented by our 
dual channel approach. In Fig. 10, YOLOv2 results contain two false positives: the 
reflection of the traffic light on the bus body and sunlight on the building. Our dual 
channel approach prevents these two false positives successfully as no response for 
these two false positives can be found in the dark image in Fig. 10. 





Fig. 10 YOLOv2 versus HDR [22]. Left: YOLOv2, two false positives (in red circles) caused by 
the reflection of the traffic light on the bus body and the sunlight on the building, respectively; 
Middle: HDR, the two false positives in YOLOv2 are prevented because there is no response for 
these two false positives in the dark image (right) 





Fig. 11 YOLOv2 versus HDR [22]. Left: YOLOv2, one false positive (in red circle); Middle: 
HDR, the false positive in YOLOvz?2 is prevented in HDR because there is no response for this false 
positive in the dark image (right) 
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Fig. 12 YOLOv?2 (top) versus HDR (bottom) [22]. The false positives in the top caused by traffic 
sign and pedestrian can be prevented in the bottom 
frame) between HDR and 


With ROI (ms) | Without ROI 
(ms) 
YOLOV? [22 HDR 


Table 8 Comparison of 


; Time saving 
computation cost (for one 





The results obtained on the same dataset have shown that the proposed approach 
in this chapter is better than the state of the art technique in terms of speed and 
accuracy. 

The dual-channel algorithms presented in this paper are implemented in C++ on 
a Mini-PC (GIGABYTE, NVIDIA GeForce GTX 760), and can run in about 30-40 
fps depending on the number of the traffic lights on an image. The processing time 
can be saved significantly by using ROI technology, see Table 8. On the contrary, 
YOLOvz2 does not save time even uses the ROI because the network requires an 
image with fixed image size as input. 

Our method has been demonstrated on real roads using A*STAR HR AV [40] via 
Data Distribution Service (DDS) [41]. One of the test results are shown in Fig. 13 
where a few frames of a video at ten frame intervals are provided. For the interested 
readers, please refer to [22] or link for more video demo: 

https://www.youtube.com/watch?v=HQqaAvuJI_I 


6 HDR Imaging Vehicle Signal Recognition 


Vehicle following is one of the fundamental functions of an autonomous vehicle. It is 
important to detect and recognize tail light signals to prevent an autonomous vehicle 
from rear-end collisions or accidents. A cost-effective approach is expected although 
sensors like acoustic sonar or commercialized Advanced Driving Assistance System 
(ADAS) products such as mobileye could be used for rear-end collision warning. 
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Encouraged by the good performance in TLR, we have extended the dual-channel 
method presented in this chapter to Vehicle Signal Recognition (VSR) [20, 23]. The 
vehicle signal light, similar to traffic light, could be recognized robustly from dark 
channel where the lights are with clean background. Similar to TLR, our VSR is a 
two-stage approach: vehicles are detected from bright images using deep learning 
detector and then the signal light is then recognized from dark images using CNN. 
Unlike previous vehicle signal recognition approaches where pair taillight has to be 
extracted explicitly, we use vehicle appearance image instead. 

Due to the length limitation, only Brake Light Recognition (BLR) is discussed 
in this chapter. Other signal lights, e.g. left turn or right turn, can be recognized in 
a way similar to the brake light recognition although video sequence analysis, e.g. 
LSTM [42], rather than single image is needed. 
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6.1 Related Approaches 


Vehicle signal recognition has been explored by various approaches. The existing 
approaches can be roughly divided into two categories: temporal information [43-48 | 
or single image [49-52]. Most of them detect signal lights using red color features 
and then try to pair taillights using their symmetry property. Cui et al. [49] proposed 
a hierarchical approach to detect vehicle and tail-lights in the daytime. They adopted 
Deformable Part Model (DPM) [53] to detect vehicle from images. The red light 
candidates are then found by clustering pixels in HSV color space within the bounding 
boxes of the candidates. After pairing taillight based on the prior knowledge about 
the vehicle appearance, a sparse dictionary is learned to classify signal lights. The 
approach is hard to be applied to autonomous vehicle because the slow processing 
speed, occlusion and possible false positives. Besides the slow detection problem 
for DPM, a serious problem of this approach 1s that the tail-lights could be occluded 
which resulting in taillight pairing consequently results in failure. In addition, the 
noise from the urban road environment, e.g. traffic lights, streetlight, could affect the 
detection of the tail-lights. 


6.2 Two-Stage Vehicle Signal Recognition 


The use of our dual-channel mechanism makes it possible to separate detection and 
recognition as two stages which can run in different exposure images. Similar to TLR 
mentioned above, we separate the VSR as two stages: vehicle detection and signal 
light recognition. The first stage will be executed with high exposure/bright image 
and the second stage will be run with low exposure/dark image. 


6.3 Vehicle Detection 


Although deep learning object detection has achieved very promising results, only 
a few of them can run in real-time. Based on very deep VGG-16 model [54], faster 
RCNN [36], one of the state of the art object detectors, can only reach frame rate at 
5 frame per second, is far from the AV requirements (at least 10 frame per second is 
required because more perception module could share the computer source). 
YOLO [37] and SSD [39] are examples of object detector can run in real-time. The 
detection is formulated as a regression problem in YOLO. As it access image only 
once, a fast frame rate is achieved. In this chapter, consistent with the comparison 
made in Sect. 5.1, YOLOv2 [38] is adopted as state of the art object detector to 
detect vehicle from a signal image. As we mentioned in Sect. 5.2, YOLOv2 [38] has 
achieved better performance than others, like Faster R-CNN [36] and SSD [39]. 
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Fig. 14 Vehicle detection with YOLOv2 


We retrain YOLOv?2 detector with our database collected on real roads by using 
our autonomous vehicle. 21,000 images are annotated, results in more than 100,000 
vehicle samples (about four to five vehicles in each image). We define eight classes 
of objects: (1) car; (2) truck; (3) lorry; (4) van; (5) bus; (6) motor cycles; (7) bicycle; 
(8) pedestrian. Two examples of the vehicle detection results are shown in Fig. 14. 


6.4 Brake Light Pattern and Recognition 


Unlike existing BLR methods which require one to explicitly extract left and right 
tail-lights, appearance based deep learning is proposed in this chapter to recognize 
brake lights. In other words, the regions, we call it Brake-Light Pattern (BLP) in this 
chapter, within the bounding boxes detected by YOLOv2 detector are directly used 
to recognize brake light. 

State of the art performance has been achieved by deep learning on a number 
of image recognition benchmark databases, e.g. ILSVRC-2012 [55]. Similar to the 
TLR presented in the previous sections, we state that the brake-light can be learned 
well from dark images than from bright images. Besides the clean background of 
dark images makes the lights recognition robust, the occlusion problem could be 
overcome to some extent by using BLPs, see Fig. 15, proposed in this chapter rather 
than a pair of tail-lights. Furthermore, the middle brake-light included in the BLP, 
most of them are located at the rear window of vehicles, makes the recognition more 
reliable than that using only left and right tail-lights. The previous approaches do not 
use this middle light because it is hard to extract this relatively darker light compared 
with the left and right tail lights. 

An example is shown in Fig. 15. The BLPs of a vehicle, corresponding to their 
bounding boxes in the left (bright images), are shown in the right (dark images). The 
brake light can be recognized accurately from dark images because the difference 
between the “braking” and “normal” on a dark image is much large than that on their 
counterpart on bright image. 
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(b) 





Fig.15 The Brake-Light Pattern (BLP) of a vehicle (right, dark image) corresponding to the vehicle 
detection bounding box shown in the left (bright image) 


6.5 Experimental Results 


Similar to TLR, no public benchmark is available for brake-light recognition, espe- 
cially HDR benchmark. Most of the brake light recognition systems use bright image 
only. Nevertheless, in this chapter, the quantitative analysis of our proposed dual- 
channel method has been done based on our own database. The comparison with 
the approach which use only bright image is provided to show the advantages of our 
approach. 

The same videos used in the TLR, see Sect. 3, have been used to train and eval- 
uate brake light recognition. As vehicle detection results are the same for previous 
approach and our approach (same bright image and same detector), the brake light 
recognition results, using bright image in previous approaches and dark image in our 
approach, are compared. 

The ground truth (“normal” or “braking’’) comes from 1,001 images containing 
2,123 samples. In order to train network, we generate about one million samples 
from above seed samples using the same way described in Sect. 5.1. Some training 
samples are shown in Fig. 16. The images with “normal” and “braking” patterns are 
shown in the top and bottom rows, respectively. 
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Fig. 16 Some training samples generated from seed images for “normal” (top row) and “braking” 
(bottom row) 





The ten-fold evaluation is adopted to test the accuracy. The results are listed in 
Table 9. 

The average accuracy of our method is found to be 97.5%, much better than that 
of previous approaches, 89%, obtained by using bright image. The vehicle detection 
rate is found to be 99.5%. 

Figures 17 and 18 are two examples for brake light recognition experiments. The 
bounding box of vehicle is marked in green or red when it 1s identified as “normal” 
or “braking”, respectively. The method can solve partial occlusion problem (Fig. 18) 
because a pattern rather than pair light is used. 


Table 9 Comparison of the previous approaches (bright image) and our approach (dual channel) 


Previous approaches (bright image only) Our approach (dual channel) 


Accuracy (%) 97.5 








Fig. 17 Brake light recognition results [20]. Left: “normal” (green); right: “braking” (red) 
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Fig. 18 Brake light 
recognition under partial 
occlusion conditions [20]. 
The brake status of the 
vehicle on the right can be 
recognized even its right tail 
light is fully occluded 





Similar to the TLR presented in Sect. 5, the algorithms developed for VSR in this 
chapter have been integrated into our autonomous vehicle, A*STAR IIR AV [40]. The 
demonstrations on real roads, including vehicle following, obstacle avoidance, etc., 
have shown that both the accuracy and the speed are satisfied with the autonomous 
vehicle requirement. Run the VSR and TLR together in the same PC presented in 
Sect. 5.2, i.e. Mini-PC (GIGABYTE, 2.5 Ghz CPU, GTX 760), we achieve 25-35 fps 
depending on the number of the traffic lights and vehicle signal lights on an image. 


7 Conclusion and Future Work 


A real-time TLR system has been proposed in this chapter to detect and recognize TL 
based on high dynamic range imaging and deep learning. The advantages of a HDR 
camera, i.e. multiple exposure images, are fully used. The drawback of the state of 
the art, which uses only bright images and false positives could be caused, can be 
overcome by our approach because the low exposure image has clean background 
(dark) that ensures the TL can be detected reliably. Furthermore, the candidates 
on the high exposure image, corresponding to the one on the dark image, can be 
recognized with high accuracy because of rich context is available. The number of 
the TL candidates to be identified by CNN is significantly reduced by using saliency 
map and ROI. This makes it fast as well robust to noise, e.g. vehicles’ tail lights. 
Finally, the accuracy and reliability are furtherly improved by developing a tracking 
technology. By executing the method on a large database collected from real roads, 
we have shown that the performance of our method is better than the state of the 
art. Encouraged by the good performance of the TLR, we extend our dual-channel 
method to VSR. Vehicles are detected from bright images and the vehicle signal 
lights are recognized from the counterpart dark images. Similar to TLR, good VSR 
performance has been achieved. The online tests on our autonomous vehicle have 
done successfully. It has been verified that our method satisfies the speed and accuracy 
requirements of an autonomous vehicle. 
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The investigations on using both dark and bright images as input to the CNN 
network could be done in the near future. The quantitative performance at night 
could be done. Actually, during the test on real road, we have observed that our 
dual-channel method is feasible at night. This is because that the night effects are not 
be high when we detect traffic lights from dark images. Proper camera parameters 
and re-trained CNN model with night data would be sufficient for night perfor- 
mance. What we need to do is nothing but adjusting camera parameters properly and 
re-training the CNN with night data. Lastly, the RNDF (Route Network Definition 
File) could be adopted in the future to locate traffic lights. It is clear that the false 
positives can be eliminated significantly by fusing with RNDF information. 


Acknowledgements We have benefited enormously from ideas and discussions with our 
ex-colleagues: Yu Pan, Serin Lee, Zhi-Wei Song, Boon-Siew Han and Vincensius-Billy Saputra. 


References 


1. Jensen, M.B., Philipsen, M.P., Trivedi, M., Mogelmose, A., Moeslund, T.: Vision for look- 
ing at traffic lights: issues, survey, and perspectives. IEEE Trans. Intell. Transp. Syst. 17(7), 
1800-1815 (2016) 

2. Diaz, M., Pirlo, G., Ferrer, M.A., Impedvov, D.: A survey on traffic light detection. In: Pro- 
ceedings of ICIAP 2015 workshops on New Trends in Image Analysis and Processing, Lecture 
Notes in Computer Science, vol. 9281, pp. 201—208 (2015) 

3. Philipsen, M.P., Jensen, M.B., Mogelmose, T., Moeslund, T.B., Trivedi, M.M.: Ongoing work 
on traffic lights: detection and evaluation. In: Proceedings of 12th IEEE International Confer- 
ence on Advanced Video and Signal Based Surveillance (AVSS) (2015) 

4. Gong, J., Jiang, Y., Xiong, G., Guan, C., Tao, G., Chen, H.: The recognition and tracking 
of traffic lights based on color segmentation and CAMSHIFT for intelligent vehicles. In: 
Proceedings of IEEE Intelligent Vehicle Symposium (2010) 

5. Siogkas, G., Skodras, E., Dermatas, E.: Traffic lights detection in adverse conditions using 
color, symmetry and spatiotemporal information. In: Proceedings of International Conference 
on Computer Vision Theory and Applications, pp. 620-627 (2012) 

6. Charette, R. Nashashibi, F.: Traffic light recognition using image processing compared to 
learning processes. In: Proceedings of IEEE/RSJ International Conference on Robots and 
Systems, pp. 333-338 (2009) 

7. Diaz-Cabrera, M., Cerri, P., Sanchez-Medina, J.: Suspended traffic lights detection and distance 
estimation using color features. In: Proceedings IEEE International Conference on Intelligent 
Transportation Systems, pp. 1315-1320 (2012) 

8. Levinson, J., Askeland, J., Dolson, J., Thrun, S.: Traffic light mapping, localization, and 
state detection for autonomous vehicles. In: Proceedings of International IEEE Conference 
on Robotics and Automation (ICRA), pp. 5784-5791 (2011) 

9. Haltakov, V., Mayr, J., Unger, C., Ilic, S.: Semantic segmentation based traffic light detection 
at day and at night. In: Proceedings of German Conference on Pattern Recognition, Lecture 
Notes in Computer Science, vol. 9358, pp. 446-457 (2015) 

10. Charette, R., Fawzi Nashashibi, F.: Real time visual traffic lights recognition based on spot light 
detection and adaptive traffic lights templates. In: Proceedings of IEEE Intelligent Vehicles 
Symposium (2009) 

11. Fairfield, N., Urmson, C.: Traffic light mapping and detection. In: Proceedings of International 
IEEE Conference on Robotics and Automation (ICRA), pp. 5421-5426 (2011) 


Traffic Light and Vehicle Signal Recognition ... 191 


12. 


13. 


14. 


15. 


16. 
17. 


18. 


19. 


20. 


pale 


22. 


23: 


24. 


pis 


20. 
Zy 


28. 


29. 


30. 


31. 


oF 


33. 


34. 


my 


John, V., Yoneda, K., Qi, B., Liu, Z. Mita, S.: Traffic light recognition in varying illumination 
using deep learning and saliency map. In: Proceedings of International IEEE Conference on 
Intelligent Transportation System (ITSC) (2014) 

Gradinescu, V., Gorgorin, C., Diaconescu, R., Cristea, V., lftode, L.: Adaptive traffic lights using 
car-to-car communication. In: Proceedings of 65th IEEE Vehicular Technology Conference, 
pp. 21-25 (2007) 

Kumar, N., Lourenco, N., Terra, D., Alves, L.N., Aguiar, R.L.: Visible light communication 
in intelligent transportation systems. In: Proceedings of IEEE Intelligent Vehicle Symposium, 
pp. 748-753 (2012) 

Dresner, K., Stone, P.: A multiagent approach, to autonomous intersection management. Artif. 
Intell. Res. 31, 591-656 (2008) 

High Dynamic Range. https://en.wikipedia.org/wiki/High-dynamic-range_imaging 

Jang, C., Kim, C., Kim, D., Lee, M., Sunwoo, M.: Multiple exposure images based traffic light 
recognition. In: Proceedings of IEEE Intelligent Vehicle Symposium, pp. 1313-1318 (2014) 
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings 
of International IEEE Conference on Computer Vision and Pattern Recognition, pp. 886—893 
(2005) 

Wang, J., Song, Y., Leung, T., Rosenberg, C., Wang, J., Philbin, J., Chen, B., Wu, Y.: Learn- 
ing fine-grained image similarity with deep ranking. In: Proceedings of IEEE Conference on 
Computer Vision and Pattern Recognition, pp. 1386—1393 (2014) 

Wang, J.-G., Zhou, L.-B., Pan, Y., Lee, S., Han, B.-S., Billy, V.: Appearance based brake-lights 
recognition. In: Proceedings of IEEE Intelligent Vehicle Symposium (2016) 

Casares, M., Almagambetov, A., Velipasalar, S.: A robust algorithm for the detection of vehicle 
turn signals and brake lights. In: Proceedings of International IEEE Conference on Advanced 
Video and Signal-Based Surveillance, pp. 386—391 (2012) 

Wang, J.-G., Zhou, L.-B.: Traffic light recognition with high dynamic range imaging and deep 
learning. IEEE Trans. Intell. Transp. Syst. 20(4), 1341-1352 (2019) 

Wang, J.-G., Zhou, L.-B., Song, Z.-W., Yuan, M.-L.: Real-time vehicle signal lights recognition 
with HDR camera. In: Proceedings of IEEE International Conference on Internet of Things 
(Things) (2016) 

Saliency Map. https://en.wikipedia.org/wiki/Saliency_map 

Kim, H.-K., Park, J.H., June, H.-Y.: Effective traffic lights recognition method for real time 
driving assistance system in the daytime. Int. J. Electr. Comput. Eng. 5(11), 1429-1432 (2011) 
Bradski, D.: Dr. Dobb’s journal of software tools 

Zebra2 camera. _https://www.ptgrey.com/zebra2-28-mp-color-gige-hd-sdi-sony-icx687- 
camera 

Lu, H., Zhang, H., Yang, S., Zheng, Z.: Camera parameters auto-adjusting technique for robust 
robot vision. In: Proceedings of International IEEE Conference on Robotics and Automation, 
pp. 1518-1523 (2010) 

Agarwal, V., Abidi, B.R., Koschan, A., Abidi, M.A.: An overview of color constancy algo- 
rithms. J. Pattern Recogn. Res. 1(1), 42-54 (2006) 

Shim, I., Lee, J.-Y., Kweon, I.S.: Auto-adjusting camera exposure for outdoor robotics using 
gradient information. In: Proceedings of IEEE/RSJ International Conference on Intelligent 
Robotics and Systems, pp. 1011-1017 (2014) 

Deep learning. Wiki. https://en.wikipedia.org/wiki/Deep_learning 

Hu, Y., Xie, X., Ma, W.-Y., Chia, L.-T., Rajan, D.: Salient region detection using weighted 
feature maps based on the human visual attention model. In: Proceedings of Pacific Rim 
Conference on Multimedia (2004) 

Achanta, R., Hemami, S., Estrada, F., Susstrunk, S.: Frequency-tuned salient region detection. 
In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2009) 
Krizhevsky, A., Sutskever, L. Hinton, G.E.: ImageNet classification with deep convolutional 
neural networks. In: Proceedings of NIPS (2012) 

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, 
T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of ACM 
International Conference on Multimedia, pp. 675-678 (2014) 


192 J.-G. Wang and L.-B. Zhou 


36. Ren, S. He, K., Girshic, R., Sun, J.: Faster R-CNN: toward real-time object detection with 
region proposal networks. https://arxiv.org/abs/1506.01497 

37. YOLO: real-time object detection. https://pjreddie.com/darknet/yolov1/ 

38. Redmon, J. Farhadi, A.: YOLO9000: better, faster, stronger. https://arxiv.org/abs/1612.08242 

39. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: SSD: single 
shot multibox detector. https://arxiv.org/abs/1512.02325 

40. ILRAV. https://www.a-star.edu.sg/i2r/RESEARCH/AUTONOMOUS-S YSTEMS 

41. Data Distribute Service. https://en.wikipedia.org/wiki/Data_Distribution_Service 

42. Long short-term memory. https://en.wikipedia.org/wiki/Long_short-term_memory 

43. Koller, D., Weber, J., Malik, J.: Robust multiple car tracking with occlusion reasoning. Springer, 
Berlin (1994) 

44. She, K., Bebis, G., Gu, H., Miller, R.: Vehicle tracking using online fusion of color and shape 
features. In: Proceedings of 7th IEEE International IEEE Conference on Intelligent Transporta- 
tion Systems, pp. 731-736 (2004) 

45. Chan, Y.-M., Huang, S.-S., Fu, L.-C., Hsiao, P.-Y.: Vehicle detection under various lighting 
conditions by incorporating particle filter. In: Proceedings of IEEE Intelligent Transportation 
Systems Conference, pp. 534-539 (2007) 

46. Malley, R., Jones, E., Glavin, M.: Rear-lamp vehicle detection and tracking in low-exposure 
color video for night conditions. IEEE Trans. Intell. Transp. Syst. 11(2), 453-462 (2010) 

47. Casares, M., Almagambetov, A., Velipasalar, S.: A robust algorithm for the detection of vehicle 
turn signals and brake lights. In: Proceedings of IEEE Ninth International Conference on 
Advanced Video and Signal-Based Surveillance, pp. 386-391 (2012) 

48. Almagambetov, A., Casares, M., Velipasalar, S.: Autonomous tracking of vehicle rear lights and 
detection of brakes and turn signals. In: Proceedings of IEEE Symposium on Computational 
Intelligence for Security and Defence Applications (CISDA), pp. 1—7 (2012) 

49. Cui, Z.-Y., Yang, S.-W., Tsai, H.-M.: A vision-based hierarchical framework for autonomous 
front-vehicle taillights detection and signal recognition. In: Proceedings of IEEE 18th Interna- 
tional Conference on Intelligent Transportation Systems, pp. 931-937 (2015) 

50. Thammakaroon, P., Tangamchit, P.: Predictive brake warning at night using taillight character- 
istic. In: Proceedings of IEEE International Symposium on Industrial Electronics, pp. 217—221 
(2009) 

51. Ming, Q., Jo, K.-H.: Vehicle detection using tail light segmentation. In: Proceedings of 6th 
IEEE International Forum on Strategic Technology (IFOST), vol. 2, pp. 729-732 (2011) 

52. Nagumo, S., Hasegawa, H., Okamoto, N.: Extraction of forward vehicles by front-mounted 
camera using brightness information. In: Proceedings of IEEE Canadian Conference on Elec- 
trical and Computer Engineering, vol. 2, pp. 1243-1246 (2003) 

53. Felzenszwalb, P.F., Girshick, R.B., McAllester, D.: Cascade object detection with deformable 
part models. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 
pp. 2241-2248 (2010) 

54. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recog- 
nition. arXiv:1409.1556 

55. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., 
Khosla, A., Bernstein M.: Imagenet large scale visual recognition challenge. arXiv: 1409.0575 
(2014) 


The Application of Deep Learning M®) 
in Marine Sciences cee 


Miguel Martin-Abadal, Ana Ruiz-Frau, Hilmar Hinz 
and Yolanda Gonzalez-Cid 


Abstract Ecological studies are increasingly using video image data to study the 
distribution and behaviour of organisms. Particularly in marine sciences cameras 
are utilised to access underwater environments. Up till now image data has been 
processed by human observers which is costly and often represents repetitive mun- 
dane work. Deep learning techniques that can automatically classify objects can 
increase the speed and the amounts of data that can be processed. This ultimately 
will make image processing in ecological studies more cost effective, allowing stud- 
ies to invest in larger, more robust sampling designs. As such, deep learning will be 
a game changer for ecological research helping to improve the quality and quantity 
of the data that can be collected. Within this chapter we introduce two case stud- 
ies to demonstrate the application of deep learning techniques in marine ecological 
studies. The first example demonstrates the use of deep learning in the detection and 
classification of an important underwater ecosystem in the Mediterranean (Posido- 
nia oceanica seagrass meadows), the other showcases the automatic identification 
of several jellyfish species in coastal areas. Both applications showed high levels of 
accuracy in the detection and identification of the study organisms, which represents 
encouraging results for the applicability of these methodologies in marine ecolog- 
ical studies. Despite its potential, deep learning has yet not been widely adopted 
in ecological studies. Information technologists and natural scientists alike need to 
more actively collaborate to move forward in this field of science. Cost-effective data 
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collection solutions are desperately needed in a time when large amounts of data are 
required to detect and adapt to global environmental change. 


Keywords Deep learning - Application - Marine - Posidonia oceanica ° Jellyfish - 
Semantic segmentation - Object detection 


1 Introduction 


Traditional data collection in ecological studies generally relies on human visual 
observations detecting the occurrences of organisms and relating those to environ- 
mental or anthropogenic factors. Similarly, visual observations are crucial in describ- 
ing behavioural interactions amongst individuals of the same species and other organ- 
isms. Making such ecological observations 1s often time-consuming, labour intensive 
and hence costly [1-3]. 

The high associated cost of undertaking ecological observations often restricts the 
amounts of data that can be collected, therefore limiting the robustness of studies and 
the conclusions that can be drawn from the results. With the advent of relatively cheap 
video recording techniques, visual observations can now be made simultaneously at 
multiple sites covering larger spatial and temporal scales reaching environments 
where previously no human observations could be made. This is of particular rele- 
vance for ecological studies investigating organisms inhabiting underwater marine 
ecosystems. Here, human based observations are limited by constraints of depth and 
time. Human based observations through divers are generally limited to depths of 
approximately 30m and may last only a couple of hours (depending on depth), while 
deeper depth can only safely be reached with increasingly more complex technolo- 
gies [4-6]. Video cameras in contrast can easily be deployed to almost any depth 
and from any type of platforms (e.g. [7—11]). Video observations have thus increased 
dramatically the potential for data collection in marine sciences. 

Nevertheless, these advances have not yet led to a reduction in the cost of ecolog- 
ical studies using images as a data source. While more data can now be collected, its 
subsequent interpretation and analysis is often still done by humans. This process is 
highly repetitive and often takes the same amount of time or longer than the recording 
of the original images thus keeping costs elevated [12, 13]. 

Computer aided automatic classification of images using deep learning can signif- 
icantly increase the speed and thus the cost of image data interpretation and analysis. 
While expert knowledge is still needed to train and quality check the computer aided 
image interpretation, the automatising allows for the processing of larger data-sets 
within a fraction of the time a human observer would require. Additionally, the 
computer aided interpretations often have a higher precision compared to humans 
[14]. 

While in recent years there has been an increasing interest in the use of automatic 
image classification in ecology, there are currently still few scientists adopting these 
technological advances, probably due to the interdisciplinary know-how boundary 
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between ecology and new information technologies (but see e.g. [15—19]). This lack 
of uptake may however be overcome in the future as end-user-based interfaces for 
this technology become more user friendly. 

Automated image classifications and segmentation through deep learning 
techniques represent a game changer for ecological studies using image-based data 
collection. It opens the possibility to increase data collection and processing to a 
completely new level with the potential of delivering more robust and statistically 
sound data at a highly reduced cost. The reduced cost may also provide the solution 
for the maintenance or establishment of highly important long-term data collection 
against the backdrop of anthropogenic change [20]. The collection of long-term data 
at the appropriate spatial and temporal scale has thus far lacked commitment by gov- 
ernments and scientists alike due to their high cost and initially low scientific returns 
respectively [21]. 

In this chapter we present two case studies that use deep learning to automati- 
cally process underwater images with the aim of showcasing the potential of these 
methodologies in improving ecological data collection and processing. The studies 
presented represent promising solutions for the data collection and processing of two 
marine organisms highly relevant for society. 

The first case study demonstrates the identification of seagrass meadows, Posi- 
donia oceanica, from video sequences recorded from an Autonomous Underwater 
Vehicle (AUV) using semantic segmentation. Seagrass meadows provide a wide 
range of benefits for society such as the attenuation of wave energy thus contribut- 
ing to the maintenance of sandy beaches as well as providing a habitat for many 
commercial and non-commercial species. With the help of deep learning, larger and 
more precise habitat maps can be produced for long term monitoring, vital for the 
management and protection of this habitat. 

The second case study shows how different species of jellyfish, some of them with 
negative impacts for society, can be identified and classified using object detection 
deep learning algorithms. This detection and assessment of jellyfish has relevance 
with respect to increasing our understanding of jellyfish ecology and also provides the 
potential for coastal monitoring systems to mitigate impacts of jellyfish on humans. 


2 Methodology 


Deep learning enables computational models composed of multiple processing layers 
to learn representations of data with different levels of abstraction. Deep Learning is 
one of the sub-fields of Machine Learning and has been advancing at an impressive 
pace over the last couple of years, bringing excellent results in different disciplines. 
In particular, Convolutional Neural Networks (CNN or ConvNet) [22] are achieving 
important milestones in image, video and audio processing and have been widely 
adopted by the computer vision community. A CNN is a particular kind of deep 
neural network consisting of an input layer, an output layer and multiple hidden 
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layers. The hidden layers of a CNN consist of diverse convolutional layers, RELU 
activation layers, pooling layers, fully connected layers and normalisation layers. 

The wide range of algorithms and applications of CNN in computer vision can 
be classified into four main different types: 


e Classification. Given a raw image the task 1s to identify the class which the image 
belongs to. 

e Classification and Localisation. Given a raw image, with only one object in it, the 
task is to find the location of the object within the image. 

e Object Detection. The task is to identify the location of several objects within an 
image. Objects might be of the same class or different classes altogether. 

e Image Segmentation. Each pixel composing an image is classified and assigned to 
a particular class. Image segmentation is also known as semantic segmentation. 


The main methodology and general requirements needed when implementing 
CNN for image processing purposes varies whether we are using it for classification, 
object detection, or segmentation. 

Due to the vast resources required to train deep learning architectures or the large 
and challenging data-sets on which deep learning models should be trained, it is 
very common to use transfer learning instead of designing a model from scratch. 
In transfer learning, a model trained in order to perform one task is re-trained to 
accomplish a second related task, which allows an improved performance when 
modelling the second task. 

Therefore, the first step using transfer learning is to select a pre-trained source 
model from already available models that best fits the application needs. If the dataset 
in your problem domain is similar to ImageNet dataset [23], a pre-trained model on 
this dataset can be used. The most widely used pre-trained models are VGG net [24] 
with 19 or 16 layers, ResNet [25] with 152, 101, 50 layers or less, DenseNet [26] 
with 201, 169 and 121 layers, Inception [27] or Xception [28]. 

The next step is to organise the data needed to train the selected model. Data should 
be divided in two subsets, the training set and the testing set. A ground truth (GT) 
for both subsets should also be generated. GT images are those labelled by experts 
using direct observation, from which the network will learn during the training. 

Training deep neural networks is difficult. It requires knowledge and expertise in 
order to properly train and obtain an optimal model. Different model training algo- 
rithms may require different hyperparameter tuning. The hyperparameters change 
some network’s features or its training process, and are fixed before the training 
process begins. 

In general, the values of the hyperparameters are chosen by training the network 
several times with different values and deciding which ones work best by evaluating 
the results. 

Moreover, during the training process there are different methods that can be 
used to improve the validation. One of the most commonly used is cross validation. 
For each combination of hyperparameters the model is trained using the K-fold 
Cross Validation method [29], splitting the data into X subsets of the same size and 
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training the network X times, each time making use of one subset to test the network 
and the remaining X — | subsets to train it. This method achieves a reduction of the 
results variability, obtaining a more accurate performance estimation in the validation 
process. 

From each cross-validation training applied over a set H of hyperparameters, 
X models are generated, Mj, where H = 1,2, ..., h represents the hyperparameter 
set number and i = 1, 2,..., X the model index. Subsequently, the X models are 
executed over their corresponding test subset, obtaining the predictions, P!,. From 
these predictions, each model is evaluated, assessing its performance, R. Note that 
depending on the model trained and its output (classification, localisation, object 
detection or image segmentation) the metrics to be used and the evaluation process 
might be different. Finally, the performance Ry of each set H of hyperparameters 
is obtained by computing the mean of its X models performance R‘,. 

The workflow for assessing the performance of each hyperparameter set is repre- 
sented in Fig. 1. 

The next sections show the training and validation process of two different deep 
ConvNet for automatically Posidonea oceanica segmentation and jellyfish detection 
and classification, respectively. 


3 Seagrass Segmentation 


Posidonia oceanica 1s an endemic Mediterranean seagrass species that forms dense 
and extensive meadows that grow down to a depth of 45 m. From a social-ecological 
perspective, this ecosystem is of up-most importance, since it plays a crucial role in 
the maintenance of coastal processes and functions and provides a range of benefits 
and services to society [30, 31]. Recent studies have evidenced a globally decline of P. 
oceanica [32, 33]. As aresult of the previous statements, the European Commission 
directive 92/43/CEE identifies P. oceanica as a priority natural habitat. 

The management and restoration strategies for P. oceanica heavily rely on aspects 
such as the monitoring and mapping of the coverage and state of the meadows. 
These aspects are fundamental in the assessment of P. oceanica conservation status, 
allowing to prematurely detect decline trends, or assess how effective an applied 
protection and recovery measure is. 

Currently, the monitoring tasks are mostly carried out by divers, measuring in 
a manual manner meadow parameters such as lower limit depth, shoot density or 
extension [34]. However, the collection of these data is slow and costly. 

Other monitoring approaches P. oceanica make use of multi-spectral satellite 
images [35], acoustic bathymetries [36] or Autonomous Underwater Vehicles (AUV) 
equipped with sensors to obtain different parameters from P. oceanica meadows [37, 
38]. These techniques suffer from some disadvantages, some of them are the poor 
effectiveness in large depth areas, the inability to distinguish between P. oceanica and 
other algae types or the fact that they can not perform the detection autonomously. 
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In [39] an autonomous detection was achieved by combining traditional image 
descriptors with Machine Learning (ML) and the use of Support Vector Machines 
(SVM). Additionally, in [40] the idea of using Convolutional Neural Networks (CNN) 
for P. oceanica detection is explored. These approaches also have some inconve- 
niences, in both of them the classification is not made at a pixel level, instead, the 
images are divided into smaller patches, classifying each patch as as P. oceanica or 
as background. This causes loss of information and lower prediction resolution since 
all pixels of a patch are imposed the classification class of the patch they belong to. 

The application of deep learning techniques allows for the use of neural network 
architectures with more hidden layers that, along with a semantic segmentation clas- 
sification, can perform a per pixel classification instead of a patch-based one, avoiding 
the information loss and obtaining a full-image resolution classification, obtaining 
an improved accuracy in the classification task. 

The main goal in this case study was to perform an automatic segmentation of P. 
oceanica meadows in sea-floor images. 

The following sections describe the deep neural network used and its main char- 
acteristics, expose the different study cases and hyperparameter combinations, the 
data acquisition and processing, the validation and evaluation processes, and finally, 
the classification results. 


3.1 Deep Learning Approach 


To determine the areas where P. oceanica was present, a semantic segmentation 
architecture was used. Subsequently, we describe the architecture of the network 
used and its training details. 


Network Architecture 


The architecture used is the so called VGG16-FCN8, which is a fully convolutional 
network, meaning that can make dense pixel-wise predictions for image tasks like 
semantic segmentation. These architectures are divided into two blocks, the encoder 
and the decoder. 

The encoder extracts spatial features from the input images making use of a series 
of convolutional layers. These layers apply a convolution by sweeping a kernel over 
the input and passing the result to the following layer. Time this process 1s carried out 
X times over the same input but using a different kernel, generating X feature maps. 
Also, encoders implement max pooling layers, which used to reduce the feature maps 
dimension, offering a better computational performance as the number of parameters 
is reduced. 

The selected architecture makes use of the VGGI16 encoder [24], subtracting 
the last classification layer and converting the last two fully connected layers into 
convolutional layers. It contains six different sections, the first five of them are 
constructed by two or three convolutional layers and a max pooling layer. The last 
one, contains two convolutional and two drop out layers interleaved. This structure 
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allows to extract low-level coarse information from the image on the first sections, 
and then, as more convolutional and max pooling layers are applied, the feature maps 
shrink up to a 1/32 of the original image size, incorporating more complex high-level 
information. Finally, the convolutional layers of the last section maintain the spatial 
information into the decoder and generate a low resolution segmentation while the 
drop out layers help to reduce overfitting. 

The decoder purpose is to take the low resolution segmentation output of the 
encoder and up-sample it to the original image size, obtaining a high-resolution 
segmentation of it. In order to accomplish this task, a series of transposed convo- 
lutional layers are used. These layers apply an inverse convolution over the input, 
up-sampling each pixel to the convolutional kernel size. The decoder also contains 
skip layers [41], which are used to integrate the encoder’s low level features to 
higher level, coarse information from the transposed convolutional layers. Lastly, an 
activation layer obtains the final semantic segmentation. 

The selected architecture makes use of the FCN8 decoder [42], which contains 
three of the aforementioned transposed convolutional layers and three skip layers 
interleaved. By adjusting the kernel sizes and strides of the transposed convolutional 
layers, the shrinked feature maps are up-sampled into the original image size. Lastly, 
a softmax activation layer obtains the final probabilistic segmentation map. The 
explained architecture is presented in Fig. 2. 

This architecture has already been used for other segmentation tasks, like road 
segmentation for autonomous drive in [43] or class segmentation of the PASCAL 
VOC 2011-2 dataset in [42]. Always presenting great results 


Training Details 


In order to train the VGG16-FCN8 architecture, both encoder and decoder should 
be trained. Their training is conducted by means of readjusting the kernel values 
in the convolutional layers and transposed convolutional layers, respectively. This 
architecture allows to train both encoder and decoder with the same back propaga- 
tion functions, allowing its training in a single forward and backward pass for each 
iteration. 

The training process makes use of images containing P. oceanica, and their cor- 
responding label maps, where each class is marked in a different colour. 

To train the network a backpropagation function is needed, indicating the direction 
and magnitude of change. In this case, a cross-entropy loss function is used [44], its 
loss increases as the predicted probability diverges from the ground truth label. Also, 
the Adam optimization algorithm is implemented in order to help the training reach 
the global minimum error [45]. Finally, in order to help preventing overfitting [46], 
two dropout layers are interleaved between the fully connected layers of the encoder. 

In order to benefit from the advantages of transfer learning, the encoder layers are 
initialized with the pretrained weights of a VGG network trained on ImageNet [47]. 
The initialization of the transposed convolution layers of the decoder is carried out 
using bilinear upsampling. Finally, a truncated Gaussian initialization is applied to 
the skip connections. These initialization parameters for this network have already 
presented great results in [43]. 
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The trainings were performed on a computer equipped with an Intel Core 17-7700 
processor, a GeForce GTX 1080 graphic card and 16 MB of RAM. 


3.2 Experimental Framework 


This section describes the experimental framework followed in this application. 
First, the image acquisition and labelling processes are described, alongside with 
the dataset usage. Next, the different hyperparameter combinations studied are pre- 
sented. Finally, we describe the validation and evaluation processes. 


Datasets 
Acquisition 


The images used to train and test the architecture were extracted from video sequences 
recorded using cameras mounted on an AUV facing downwards. 

An AUV was navigated over P. oceanica beds located on the West and North— 
West of Mallorca (Fig. 3), obtaining images under different P. oceanica conditions 
such as health state, meadow density and coloration; or water depth, illumination 
and turbidity. 

A sample of the gathered images can be seen in Fig. 4. 


Labelling 


From the obtained images, label maps were manually built. The areas with P. oceanica 
were marked in white, and the background areas in black. These labels maps are used 
as ground for the gathered images, and are used to train and test the network. Figure 5 
shows an image with its corresponding label map. 


Dataset Arrangement 


In order to build our datasets, six different different AUV missions were performed, 
obtaining up to 483 images. These images were representative of the different envi- 
ronmental and meadow conditions encountered during the sampling process. 

From the gathered images, we generated two datasets, namely the mix dataset, 
including 460 images and the extra dataset containing 23 images. Table | indicates the 
location, month of acquisition, camera used, number of images and the corresponding 
dataset of each mission. 

The mix dataset (460 images) was destined to train (80% of the images) and 
test (20% of the images) the network, it offered a wide range of P. oceanica and 
environmental conditions, ensuring the robustness of the network training. 

The extra dataset (23 images) was recorded using a different camera form the mix 
dataset. The extra dataset was used as an additional test. It helps to detect training 
overfitting, providing information on how well the network generalises its training 
on images containing distinct unseen conditions (camera used and P. oceanica or 
environmental conditions). 
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Fig. 3 Map of the study area showing the island of Mallorca in the Western Mediterranean. Sam- 
pling points are indicated with arrows 


Table 1 Dataset arrangement 
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Fig. 5 a Original image. b Label map 
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In order to find the hyperparameters that offer the best performance, the network was 
trained with the different values and combinations, shown in Table 2. 

First, the network was trained with and without implementing data augmentation, 
this technique consists in applying contrast, brightness, color and morphological 
transformations to the training images 1n order to train over more diverse data, helping 
to reduce overfitting [48]. Secondly, two different learning rates were set, modifying 
the training step size when minimizing the loss [49]. Finally, two number of iterations 
were used, setting the times the network backpropagates and trains [49]. 


Experiments 


Following the methodology explained in Sect.2, eight different experiments were 
conducted K = 1, 2, ..., 8, each one assessing the performance of a hyperparameter 
combination, using its corresponding hyperparameters and applying a 5 k-fold cross- 
validationi = 1, 2, ...,5. On each cross validation, 4 subsets of the mix dataset (80% 
of the data) were used to train the network and the remaining one (20% of the data) 
was used to test it. Also, the entire extra dataset was used to test the network. This 
process is described in Fig. 6. 

The evaluation process of each model starts by binarizing its probabilistic outputs, 
we decided to perform this binarization at nine equally distributed threshold values, 
j =1,2,...,9 (Fig. 7). 

Then, we preformed a comparison between each binarized output and its corre- 
sponding label maps, acting as ground truth. 

From this comparison, we generated confusion matrix, which indicates the number 
of P. oceanica pixels identified correctly (True Positives, TP) and wrongly (False 
Positives, FP), and also the number of background pixels identified correctly (True 
Negatives, TN) and wrongly (False Negatives, FN). From these values, the accuracy, 
precision, recall and fall-out of the model are computed. 

Finally, a Receiver Operating Characteristic (ROC) curve is generated [50], rep- 
resenting the recall against fall-out values of the classifier at various thresholds. The 
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(a) (b) 





Fig. 7 Probabilistic network output of an image (a) and one of its corresponding binarizations (b) 
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Fig.8 Evaluation process for the model “1” of experiment “K”. The network prediction is binarized 
at j = 1, 2,..., 9 threshold values, generating a confusion matrix for each one. Finally, the evaluation 
metrics are are calculated 


analysis of the Area Under the Curve (AUC) of the ROC curve offers measure of the 
classifier performance. 
Figure 8 represents the process followed to evaluate a model. 


3.3 Classification Results 


This section presents the obtained results for each experiment along with the hyper- 
parameter selection process. 

In this section we use a three digit annotation to refer to each experiment, indicating 
its hyperparameters. The first digit implies if data augmentation was used (1) or not 
(O). The second one indicates if the used learning rate is le—O5 (1) or 5e—04 (5). The 
last digit indicates if the number of iterations is 8000 (8) or 16,000 (16). 
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Experiment Performance 
Mix Dataset Results 


Figure9 shows the results of evaluating the mix test set. In Fig. 9a, the ROC curve 
along its AUC value for each experiment is presented. While in the the precision 
and accuracy values at the optimal binarization threshold are represented in Fig. 9b 
in bar charts. The optimal binarization threshold is selected as the one presenting 
higher trade-off between recall and fall-out, calculated as: 


Recall + (1 — Fall-out 
Trade-off = a (1) 


The ROC curves for all experiments showed AUC values over 95%, reaching a 
maximum of 98.7% for the 1_1_16 experiment. According to the criteria established 
in [51] to determine how good a classifier is based on its AUC value, these values 
represent excellent classifiers. 

Precision and accuracy values were greater than 90% for all the experiments. 
The maximum Precision achieved was 97.5%, for the experiment 1_1_8, while the 
lowest one was 92.2%. For the accuracy, he maximum achieved was 96.5%, for the 
experiment 1_1_16. 

The comparison of the different experiments on a hyperparameter basis showed 
that: 


e Experiments with lower learning rates presented better precision, accuracy and 
AUC values than experiments with higher rates. 

e The effect of the number of iterations is almost negligible, being the metrics slightly 
better when trained over 16,000 iterations. 

e The application of data augmentation had a similar slight effect than the number 
of iterations, presenting a small benefit when it was applied. 


These almost negligible effects may be due specific conditions of our application, 
such as the network already being trained after the 8k iterations, and the train set 
already being diverse on its own, respectively. 

Figure 10 shows qualitative results over images of the mix test set. 


Extra Dataset Results 


The results obtained on the mix dataset were promising but, as mentioned in Sect. 3.2, 
the test images were extracted from the same immersions used to train the net- 
work, containing similar environmental conditions. To assess the performance of 
each model on unseen conditions, we evaluated them over the extra dataset, the 
results are presented in Fig. 11. 

The AUC value of experiments that used a learning rate of 5e—04 were lower 
that the ones achieved in the mix test set evaluation results, reaching values around 
92%. On the other hand, experiments that used a learning rate of le—O5 were able to 
maintain the good results obtained on the mix dataset, achieving AUC values around 
97.7% when the network was trained for 16,000 iterations and 97.0% when 8000. 
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Fig.9 Mix test set results. a ROC curves and corresponding AUC. b Precision and accuracy metrics 
obtained at the optimal binarization threshold 
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Fig. 10 Qualitative results obtained for images from the mix test set. On the first row, two original 
images are shown. The second row of images illustrate the original images with their corresponding 
ground truth superimposed in red. Finally, the last set of images show the results of the segmentation 
superimposed in green to the original images 


These results show that the models do not overfit the training images, being able to 
generalize its training to images taken with a different camera, containing different 
unseen environmental and P. oceanica conditions. 

The same tend can be seen for the precision and accuracy, where experiments 
with higher learning rates achieved values for both metrics around 85%, while exper- 
iments that used lower learning rates only achieved values around 96% and 95%, 
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Fig. 11 Extra test set results. a ROC curves and corresponding AUC. b Precision and accuracy 
metrics obtained at the optimal binarization threshold 
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Fig. 12 Qualitative results obtained for images from the extra dataset. On the first row, two original 
images are shown. The second row of images illustrate the original images with their corresponding 
ground truth superimposed in red. Finally, the last set of images show the results of the segmentation 
superimposed in green to the original images 


respectively. It also can be seen, experiments where the number of iterations was set 
to 16,000 presented slightly higher metrics. 
Figure 12 shows qualitative results over test images of the extra dataset. 


Hyperparameters Evaluation 


We conducted an overall comparison on a hyperparameter basis from the evaluation 
results of all experiments, finding the hyperparameters which offer a better perfor- 
mance. 
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The results clearly indicate that experiments that used lower learning rates 
obtained better AUC, precision and accuracy results. The best learning rate was 
identified at le—O5. Also, it can be seen that experiments conducted using a large 
number of steps tend to have a slightly better performance. The best number of 
iterations was identified at 16,000. Finally, we decided to apply data augmentation, 
helping to generalize the training to new unseen conditions for future immersions. 


4 Jellyfish Detection and Identification 


Over the past decades the social and scientific concern about increasing jellyfish 
numbers has risen. This can be noticed on the number of reports on jellyfish, over the 
past two decades the number of media and news reports have dramatically increased 
by over 500% [52], often with alarmist headlines [53]. 

Parallel to this, there is an ongoing scientific debate on whether jellyfish numbers 
are on the rise, on the one hand, some scientists argue that populations are increasing 
due to a range of natural and man-made causes [54, 55], while on the other, some 
scientists defend that jellyfish populations have remained constant over time [53]. 
The lack of base line data to endorse conclusions makes it difficult to support either 
argument. 

Regardless, of the outcome of the debate, coastal populations are increasing, with 
40% of the global population living within 100 km of the coast [56] and many more 
spending their holidays and free time in coastal areas. The increase in the use of the 
coast and its associated resources and benefits is leading to a higher rate of encounters 
between humans and jellyfish with all the associated socioeconomic consequences 
[57]. Among others, jellyfish aggregations are known to negatively affect coastal 
tourism with associated impacts on tourism revenues and the tourism industry [58]. 
Large aggregations of jellyfish can interfere with fishing operations by presenting 
a health hazard to fishermen when pulling the fishing gear on board, splitting the 
fishing nets due the weight of the jellyfish in the nets or ruining the catch [59]. In 
aquaculture, large aggregations of jellyfish have reportedly killed fish in pens [60, 
61]. Water desalination and power plants have also suffered the consequences of 
the presence of high numbers of jellyfish, which can clog seawater intake screens 
causing power reductions and shutdowns [62, 63]. 

There is, therefore, a need to develop new technologies that enable the automatic 
detection of these organisms to facilitate the design of adaptive management strate- 
gies in order to mitigate jellyfish associated impacts. Furthermore, the development 
of such technology will greatly facilitate the collection of long term monitoring data 
in a cost-effective way. 

So far, most studies aimed at monitoring and assessing the presence of jellyfish 
have relied on manual methods, such as visual countings from boats [64] or small 
aircrafts [65], or on a combination of video recording with subsequent human-based 
manual counting [13]. Manual methods, however, greatly limit the scope of the 
studies both from a spatial and a temporal perspective. 
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Automatic computer aided image classification represents a milestone for obser- 
vational ecological studies [17], as it can deal with the aforementioned limitations. 
In Korea and Japan, where the presence of aggregations of big specimens of jellyfish 
often interfere with human uses of the coast, the first attempts based on the automatic 
detection of objects in images have been made to counteract the presence of these 
organisms. In Korea, an automatic jellyfish elimination system was created whereby 
an unmanned aerial vehicle equipped with cameras would identify jellyfish at the 
surface of the water and would eliminate them using a blade system [66]. In Japan, 
the first attempts to develop an automatic jellyfish detection system are underway, 
however so far the system has only been tested with artificially generated jellyfish 
images [67]. 

Here, we introduce an application of Deep Learning techniques for the detection 
of jellyfish. Deep Learning techniques have emerged as a promising methodology to 
enable the automatic detection and quantification of jellyfish, simultaneously allow- 
ing for the development of early warning systems for the presence of jellyfish. As 
an example, the analysis of video data through the identification and quantification 
of jellyfish can be used in coastal areas to detect jellyfish abundances and decide the 
optimal point for beach closures in order to avoid undesired socioeconomic effects. 
From a scientific perspective, the use of automatic detection techniques will permit 
the creation and maintenance of much needed long-term data series in a cost-effective 
way. 

The case study is located on the island of Mallorca (Balearic Islands, Western 
Mediterranean basin) (Fig. 3). Mallorca is one of the major tourism destinations in 
Europe where the presence of jellyfish can sometimes cause undesired effects on 
tourism satisfaction. The example focuses on three of the most commonly encoun- 
tered jellyfish species in Balearic waters, namely Pelagia noctiluca, Cotylorhiza 
tuberculata and Rhizostoma pulmo (Fig. 17). P. noctiluca can become very abundant 
during spring and summer months and has a fairly painful sting that can be very 
off-putting. C. tuberculata although abundant towards the end of the summer is inof- 
fensive to humans. Finally, R. pulmo is mildly stinging, however due to its relatively 
big size (up to 40cm in radius) and solid appearance, swimmers are generally able 
to spot it before getting stung. 

The following sections explain the deep network architecture used and its char- 
acteristics, the different case studies, data processing, model tuning and validation 
process, and finally, the classification results. 


4.1 Deep Learning Approach 


In this application an object detection architecture is used for the detection and 
classification of the different jellyfish species. In the following section, the network 
architecture and the training details are presented. 
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Fig. 13 Inception module, showing how the input is convoluted by three different kernel sizes: 
1 x 1,3 x 3,and5 x 5. To limit the number of input channels, an extra 1 x | convolution is added 
before the 3 x 3 and 5 x 5 convolutions 


Network Architecture 


The architecture used is the Inception-Resnet V2 [68], a very deep convolutional 
neural network with over 450 layers that it can efficiently learn to identify objects 
on images, outputting instance bounding boxes and classifying them into one of the 
specified classes with a confidence percentage. 

When detecting objects on an image, one of the main problems is to select the 
kernel sizes for the convolutional layers, as the same object may appear with huge 
size and shape variations from one instance to another. A larger kernel is preferred 
for bigger, more global instances, and a smaller kernel is preferred for smaller ones. 
To tackle this issue, the architecture performs multiple parallel convolutions using 
different kernel sizes, making the network “wider” rather than “deeper”. The blocks 
of layers containing these convolutions are called inception modules [69], represented 
in Fig. 13. 

Another characteristic of the network, is the use of Residual Connections [70], 
used to add the output of the convolution operation of the inception module to the 
input. This introduces shortcuts in the model and it translates into a more optimal 
and accurate network. Figure 14 shows the structure of a Residual Connection. 

This architecture combines the inception modules with Residual Connections, 
obtaining the so called Inception-ResNet modules. Figure 15 shows an example of 
these modules. 

With these Inception-ResNet Modules, the main body of the architecture is built. 
Figure 16 shows a compressed view of the architecture. 
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Fig. 14 Residual connection 
structure 





Convolutions 





Training Details 


The Inception-ResNet V2 architecture is trained by means of readjusting the values 
of the kernels in the convolutional layers, backpropagating the loss computed over 
the predictions obtained on the softmax layers. 

Due to the high number of layers, the loss becomes small and insufficient to 
update the kernel values properly. To prevent the middle part of the network from 
“dying out” during the backpropagation process, an auxiliary classifier is applied at 
the output of the second block of Inception-ResNet modules. In this way, an auxiliary 
loss is computed and added to the prior one as shown in Eq. 2. 


T otal_loss = main_loss + aux_loss x 0.3 (2) 


In order to train the network and adjust the kernel weights, a backpropagation 
function is needed. For this case, a smooth L1 location loss function 1s used, its loss 
increases as the predicted bounding box location diverges from the one specified 
on the ground truth. Also, the Momentum optimiser algorithm along with gradient 
clipping strategies [71] are implemented in order to help the training process reach 
the global minimum error. 

The architecture used for this application, had already been trained over the COCO 
dataset [72]. To retrain the network with the desired classes, a set of images containing 
different jellyfish species and its corresponding ground truth are needed. The ground 
truth in this case is a text file for each image, where the bounding box and class for 
each jellyfish instance present in the image are indicated. 

The trainings were performed on the same computer mentioned in Sect. 3.1. 
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Fig. 15 Inception-ResNet-A module. The Max pooling branch from the Inception Module is sub- 
stituted by the Residual Connection. The 5 x 5 convolution is split into two equivalent 3 x 3 con- 
volutions, boosting computer and accuracy performance (neural networks perform better when 
convolutions do not alter the dimensions of the input drastically). Finally, for the residual sum 
to work, the input and output after convolution must have the same dimensions, hence, a | x 1 
convolution is applied after the original convolutions, to match the depth sizes 
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Fig. 16 Neural network architecture, mainly composed by Inception-ResNet-A/B/C Modules, 
along with other complementary modules. More in depth information about this architecture can 
be found in [68] 
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4.2 Experimental Framework 


This section describes the experimental framework followed. First, the image acqu1- 
sition, organisation and labelling processes are described. Subsequently, the different 
case studies and hyperparameters used are presented. Finally, we describe the vali- 
dation and evaluation details. 


Datasets 
Acquisition 


Training and testing images were extracted from underwater video sequences of the 
three species under consideration. The objective was to construct a dataset contain- 
ing the three species under different conditions, such as water coloration, turbidity, 
illumination and different jellyfish positions and sizes, assuring robustness in the 
training process. 

A dataset of 842 images was generated, 80% of the dataset was used to train the 
network (674 images), while the remaining 20% was used for testing purposes (168 
images). 

Figure 17 shows sample images from the dataset showcasing different conditions. 


Labelling 


For every image of the dataset, an annotation file was generated using the Labellmg 
tool [73], this generates an “.xml” file which contains the position and classification 
of each instance present in the image. Figure 18 shows an original image along with 
its ground truth “.xml” text file. 


Case Studies 


Following the same procedure used in the previous application in Sect. 3.2, the net- 
work was trained using different sets of hyperparameters. The network was first 
trained with and without implementing data augmentation, secondly, two different 
learning rates were set, and finally, the network was trained using two values for the 
number of iterations. Results showing the different combinations of hyperparameters 
are shown in Table 3. 


Experiments 


Following the methodology described in Sect.2 and implemented in the previ- 
ous application in Sect.3.2, twelve different experiments were conducted K = 
1,2,..., 12, each one assessing the performance of a case study, using its corre- 
sponding hyperparameters and applying a 5 k-fold cross-validation 7 = 1, 2, ..., 5, 
as shown in Fig. 19. 

In order to evaluate the performance of each model, the Intersection over Union 
(loU) method along with the average precision metric (AP) [74] were used, these are 
the most common evaluation methods for object detection, used in object detection 
competitions such as PASCAL VOC [75], ImageNet [76] or COCO [72]. 
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Fig. 17 Images from the dataset showing the three jellyfish species under different environmental 
conditions. Top: P. noctiluca, centre: R. pulmo, bottom: C. tuberculata 
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<annotation> 


<folder>Tuberculata</folder> 
<filename>IMG_00012.jpg</filename> 
<path>D: \Jellyfish\Tuberculata\IMG_00012.jpg</path> 
<source> 
<database>Unknown</database> 
</source> 
<size> 
<width>1280</width> 
<height>720</height> 
<depth>3</depth> 
</size> 
<segmented>0</segmented> 
<object> 
<name>tuberculata</name> 
<pose>Unspecified</pose> 
<truncated>0</truncated> 
<difficult>0</difficult> 
<bndbox> 
<xmin>616</xmin> 
<ymin>127</ymin> 
<xmax>973</xmax> 
<ymax>525</ymax> 
</bndbox> 
</object> 


</annotation> 


Fig.18 a Original image. b Corresponding ground truth “.xml” file, specifying the jellyfish location 


and class 


Table 3. Case studies. When 
applying data augmentation, 
random rotations and 
horizontal and vertical flips 
are applied. The decay 
learning rate consists in 
applying a learning rate of 
5e—04 until 50% of the 
training and then dropping it 
to 5e—05 


Case 


|} CO} A} BD] Ny BB] DS] NI] eR 


aaa 
NO} e| © 


ferations 5 


S5e—04 10 


Decay 10 


S5e—04 10 


Decay 10 


The IoU measure gives the similarity between the predicted and the ground- 
truth bounding-boxes areas, and is defined as the area of the intersection between 
bounding-boxes divided by the union of the bounding-boxes areas (see Eq. 3). 
Figure 20 illustrates how the [oU is calculated for a prediction. 


IoU = 


Apirersecion (3) 


Aupion 
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Fig. 20 Representation of PREDICTION 


how the IoU is calculated 
between a prediction and a 
ground truth 
bounding-boxes. The IoU 
value would be calculated as: 


_ B 
loU = ayRTe 
~ 8 


GROUND TRUTH 
a 


Once the IoU is calculated for a prediction, in order to determine if that predic- 
tions is a TP or a FP, a threshold value over the IoU is established. Following the 
criteria applied in the PASCAL VOC challenge, this threshold is set at thrj.y = 0.5. 
A prediction is classified as TP if the IoU value with any ground truth bounding-box 
is greater than the thr;,, and the predicted class matches the corresponding one of 
the ground truth, otherwise, the detection is marked as a FP. The following equation 
represents this criteria: 


TP, if loU >= thrjoy & Cored = Sgt 
FP, otherwise 


Detection = (4) 


Also, ground truth instances which do not have a IoU>thr;.y with any prediction 
are marked as FN. 

From the TP, FP and FN values, the precision and recall metrics are calculated 
for all classes. Finally, from these metrics, the AP of each class and mean AP (mAP) 
between classes are obtained. The AP can be understood as the average of the max- 
imum precision at different recall values, or the area under max(Precision)-Recall 
curve. Figure21 exemplifies calculus of the AP for a series of detections. More 
information about this evaluation metric can be found in [74]. 

The followed workflow to determine the detection performance of each model is 
showcased in Fig. 22. 
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Fig.21 Example calculus of the AP fora series of detections. The blue line represents the precision- 
Recall curve. The orange lie represents the max(Precision)-Recall curve. The AP value equals to 
the area under the max(Precision)-Recall curve (orange area) 
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Fig. 22 Model “i” of experiment “K” evaluation process. The detection output is compared with its 
corresponding ground truth, obtaining the FP, FN and TP values. From these, the Precision, Recall 
and mAP values are calculated 


4.3 Results 


This section shows the results for each of the experiments and the assessment of the 
hyperparameter. 


Experiment Performance 


The mean results obtained when evaluating the five models of each experiment over 
their corresponding test set are shown in Table4, showing the AP obtained for each 
class and the mAP value. 
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Table 4 Results obtained for all experiments from evaluating the test set, showing the AP for each 
class and the mAP over all classes 


Exp. Data aug. | Learning | Iterations | AP AP pulmo | AP tuber- | mAP (%) 
rate (k) noctiluca | (%) culata 
(%) (%) 





968 (893 
968 (89.3 
4 Decay 96.4 89.7 
aes 963 [90.2 
963 (905 
967 (903 
4073980 96.6 90.6 
10 Decay [10 (73798969 (89.7 
r 71 | 90.1 
1 968 90.5 


All experiments show mAP values around 90%, reaching a maximum of 90.6% 
for experiment 9 and a minimum value of 89.1% for experiment 1. Looking at the 
AP values for the three species, it can be seen that both R. pulmo and C. tuberculata 
have much higher mAP values than P. noctiluca. This might be due to fact that R. 
pulmo and C. tuberculata are bigger specimens and the shape of their bodies remains 
relatively unchanged while swimming and therefore they might be easier to identify. 
On the contrary, in P. noctiluca the relative position of the tentacles in relation to the 
main body (umbrella) changes to a greater extent with the movement of the animal, 
adopting therefore a multitude of shapes and thus making it more difficult to identify. 

Experiments where data augmentation is applied tend to have a slightly better 
performance. The same occurs with the number of iterations, experiments that are 
trained during 20k or 40k iterations show a small increase in performance. The 
application of the decay technique over the learning rate does not seem to have a 
significant impact over the performance. Qualitative results for the jellyfish detection 
process over the test dataset are shown in Fig. 23. 
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Fig. 23 Visualization of the jellyfish detection obtained from images of the test set. The green 
bounding boxes represent P. noctiluca detections, the blue boxes correspond to R. pulmo and the 


orange ones to C. tuberculata 
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5 Conclusions 


The two applications presented in this chapter clearly demonstrate the potential 
of deep learning in aiding the classification and processing of large image data sets 
collected in ecological studies. In the first study example, we showcase the application 
of a deep semantic segmentation neural network architecture to automatically detect 
the habitat forming seagrass species P. oceanica. 

Diverse hyperparameter configurations were tested in order to find those that 
provided the best metrics. The evaluation results of the models showcased that the 
best metrics were achieved when data augmentation was applied and the network 
was trained for 16,000 iterations with a learning rate of le—O5. A video presenting 
the network semantic segmentation can be seen at [77]. 

The results of this study are encouraging and show that deep learning techniques 
can be a useful tool for the automatic classification of underwater habitats. Future 
research should extend on this capability and build networks that can detect and 
classify multiple habitat types of coastal areas. 

In the second example study, an object detection deep network has been used to 
automatically identify three commonly occurring species of jellyfish in the Mediter- 
ranean. 

Once again, diverse hyperparameter configurations were tested in order to find 
those that provided the best metrics. In this case, the evaluation results of the models 
showcased that the best metrics were achieved when data augmentation was applied 
and the network was trained for 40,000 iterations with a decaying learning rate. A 
video presenting the network semantic segmentation can be seen at [78]. 

These results show the potential of object detection in the identification of marine 
species in image data, not only for jellyfish but for many other species that can be 
filmed in underwater environments. With respect to the detection of jellyfish, these 
results will be used to train the network to recognise more species. The demon- 
strated automatic detection methods will have direct applications for the monitoring 
of jellyfish in the proximity of beaches. 

To conclude, deep learning techniques have a huge potential in supporting ecolog- 
ical studies. Once neural networks are functioning to a high precision in the detection 
of habitats or species, they can be applied to other datasets originating from other 
locations, thus providing help in image data processing for a much wider, potentially 
global, audience of scientists. Equally, with the help of experts in providing classified 
images, existing neural networks can be extended to include more habitats or species 
in the future. However, to further this development, information technologists and 
natural scientists alike need to more actively engage with each other fields and search 
for collaborations. Deep learning techniques have an essential part to play in moving 
ecological studies to a new level, providing more cost-effective data collection solu- 
tions at a time when large amounts of data are needed to detect and adapt to global 
environmental change. 
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Abstract Collisions between birds and wind turbines can be significant problem in 
wind farms. Practical deterrent methods are required to prevent these collisions. How- 
ever, it is improbable that a single deterrent method would work for all bird species 
in a given area. An automatic bird identification system is needed in order to develop 
bird species level deterrent methods. This system is the first and necessary part of the 
entirety that is eventually able to, monitor bird movements, identify bird species, and 
launch deterrent measures. The system consists of a radar system for detection of the 
birds, a digital single-lens reflex camera with telephoto lens for capturing images, 
a motorized video head for steering the camera, and convolutional neural networks 
trained on the images with a deep learning algorithm for image classification. We 
utilized imbalanced data because the distribution of the captured images is naturally 
imbalanced. We applied distribution of the training data set to estimate the actual 
distribution of the bird species in the test area. Species identification is based on the 
image classifier that is a hybrid of hierarchical and cascade models. The main idea is 
to train classifiers on bird species groups, in which the species resembles more each 
other than any other species outside the group in terms of morphology (coloration 
and shape). The results of this study show that the developed image classifier model 
has sufficient performance to identify bird species in a test area. The proposed system 
produced very good results, when the hybrid hierarchical model was applied to the 
imbalanced data sets. 
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1 Introduction 


Demand for automatic bird identification systems for wind farms has increased 
recently. This kind of system is especially required for offshore wind farms. The 
objective of this application is twofold: it has first to detect two key bird species, 
which are particularly required for monitoring in the environmental license, and sec- 
ondly to classify maximum number of other bird species while the first part still 
stands. The two key species are the white-tailed eagle (Haliaeetus albicilla) and 
the lesser black-backed gull (Larus fuscatus fuscatus). An automatic identification 
system is in development that consists of a separate commercial radar system to 
detect the birds, a digital single-lens reflex camera with telephoto lens for capturing 
images, a motorized video head for steering the camera, and a convolutional neu- 
ral network (CNN) trained on the images with a deep learning algorithm for image 
classification. The conventional approach to this image classification problem is to 
presume that equally distributed data are fed into the classifier. However, this is a 
real-world application, in which it is difficult and time-consuming to collect large 
number of images for each class. Due to the nature of this application, it is conceiv- 
able that imbalanced data are utilized because the distribution of the captured images 
is naturally imbalanced, 1.e., there are common and scarce bird species in the test 
area. It is also possible to include scarcer classes into the classification process with 
this approach. Researchers have proposed a class-imbalance aware loss function for 
the problem of class imbalance. This loss function adds an extra class-imbalance 
aware regularization term to the normal softmax loss [1]. However, we have applied 
the distribution of the training data set to estimate the actual distribution of the bird 
species in the test area. Training data set and test data set both share this distribution. 
Species identification is based on the image classifier that is a hybrid of hierarchical 
and cascaded models. The main idea is to train classifiers on bird species groups, in 
which the species resembles more each other than any other species outside the group 
in terms of morphology. The first classifier is hierarchical determining the group of 
the test image and the subsequent classifiers within the groups are in cascades. We 
have also applied our data augmentation method, which rotate and convert the images 
in accordance with the desired color temperatures. The hybrid hierarchical and cas- 
cade model is compared to two single classifiers. One of the classifiers is trained on 
balanced data set and the other is trained on imbalanced data set without grouping. 

CNN has been successfully applied to image classification problems [2]. The 
number of training examples in image classification is typically large. This may 
cause problems when dealing with real-world applications, as collection of large 
number of images is not always possible. As a result, some data augmentation is 
usually needed [3, 4]. Cascade CNN has been successfully applied to face detection 
and road-sign classification system [5, 6]. 

The remainder of this chapter is organized as follows. In Sect. 2 we present the sys- 
tem and its components for collecting images automatically. In Sect. 3 related work 
is discussed. We describe our data, its grouping, and data augmentation algorithm 
in Sect. 4 Classification algorithms, applied CNN models, and feature extraction 
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are described in Sect. 5 Results for hybrid of hierarchical and cascade CNN model, 
trained on imbalanced data set and compared to conventional CNN model, are pre- 


sented in Sect. 6. We then offer conclusions in Sect. 7. 


2 The System 


The proposed system consists of several hardware as well as software modules. See 
Fig. | for an illustration. At first, there is the radar system, which is connected to a 
local area network (LAN), and thus it is able to communicate with the servers, in 


which the various programs are running. 


Angel of view 70° 
. Speed e.g.12m/s 

FN Target » 
( Target \ eescaseo” Target is faster than the motorized 

jj video head movement hence it will 

—_— fly through the ‘camera beam 
: lead 

slow motion when taking image 


fast motion when moving over correct position 






Camera \ USB acs 7 









steering head 
Camera control server 


Radar server Head steering server 


Fig. 1 The hardware of the system and the principle of catching flying bird into the frame area of 


the camera 
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We use aradar system supplied by Robin Radar Systems B.V. because they provide 
an avian radar system that is able to detect birds. They also have algorithms for 
tracking a detected object over time (between the blips). The model we use is the 
ROBIN 3D FLEX v1.6.3 and it is actually a combination of two radars and a software 
package for implementation of various algorithms such as the tracker algorithms [7]. 
The role of the radar is to detect flying birds and pass the WGS84 coordinates 
of the target bird to the video head control software. The system includes the PT- 
1020 Medium Duty motorized video head of the 2B Security Systems. The video 
head is operated by Pelco-D control protocol [8], and the control software for it is 
developed by us. The System uses Canon EOS 7D II camera with 20.2-megapixel 
sensor and the Canon EF 500/f4 IS lens. Correct focusing of the images relies on the 
autofocus system of the lens and the camera. Automatic exposure is also applied. The 
camera is controlled by the application programmable interface (API) of the camera 
manufacturer, and the software for controlling the camera has been developed by us. 
In addition, the radar system provides parameters, which can be applied to increase 
the performance of classifiers. These parameters are the distance in 3D of a target 
(m), velocity of a target (m/s), and trajectory of a target (WGS84 coordinates). For 
the details of the system hardware, see [9]. 


3 Related Work 


Researchers proposed a multi-sensor data fusion approach via acoustics, infrared 
camera, and marine radar for avian monitoring. The objective is to preserve the 
population of birds and bats especially those listed in endangered list, by observing 
their activity and behavior over the migration period. Species-level identification was 
not aimed mainly. They address to this problem by a fuzzy Bayesian based multi- 
sensory data fusion approach to provide the activity information regarding the targets 
in avian (birds and bats) monitoring [10]. 

Researchers have implemented machine learning (ML) algorithms on radar data 
for bird species classification. They used data collected from two locations in Portu- 
gal with two marine radar antennas (volume search radar, VSR and high sensitivity 
reception, HSR). The performance of six widely used ML algorithms: random forests 
(RF), support vector machine (SVM), artificial neural networks, linear discriminant 
analysis, quadratic discriminant analysis, and decision trees (DT), was tested. They 
found that all algorithms performed well (area under the receiver operating character- 
istic and accuracy >0.80, P < 0.001) when discriminating birds from non-biological 
targets such as vehicles, rain or wind turbines, but the algorithms showed greater 
variance in their performance when they classified different bird functional groups 
or bird species (e.g. herons vs. gulls). In this study, only RF was able to hold an 
accuracy >0.80 for all classification tasks, although SVM and DT also performed 
well. All algorithms correctly classified 86% and 66% (VSR and HSR, respectively) 
of the target points, and only 2% and 4% of these points were misclassified by all 
algorithms. The results suggest that ML algorithms are suitable for classifying radar 


Deep Learning Case Study on Imbalanced Training Data ... 235 


targets as birds, and thereby separating them from other non-biological targets. The 
ability of these algorithms for correct identification between bird species functional 
groups was much weaker [11]. 

Time-lapse photography is a method in which the frame rate of taking a sequence 
of images is higher than the frame rate used to view the sequence. Time-lapse images 
can make subtle time-related processes distinct, and the process that is analyzed, can 
be too fast or too slow to the human eye. Time-lapse images have been used to detect 
birds around a wind farm. An Image-based detection using cameras have been applied 
to build a bird monitoring system. This system utilized an open-access time-lapse 
image data set that is collected around the wind farm. The system applied algorithms: 
AdaBoost, Haar-like, histogram of oriented gradients (HOG), and CNN. AdaBoost 
is a two-class classifier, which is based feature selection and weighted majority 
voting. A strong classifier is made as a weighted sum of many weak classifiers, and 
the resulting classifier is shallow but robust [12]. Haar-like is an image feature that 
utilizes contrasts in images. It extracts the light and the shade of objects by using 
black-and-white patterns [13]. HOG is a feature used for grasping the approximated 
shape of objects. At first, it computes the spatial gradient of the image and makes 
a histogram of the quantized direction of the gradient in each local region, called 
a cell in the image. Subsequently, it concatenates the histograms of the cells in the 
neighboring groups of the cells (the blocks) and normalizes them by dividing by their 
Euclidean norms in each block [14]. The best method for detection was Haar-like, 
and the best method for classification was CNN. The system was tested on only two 
bird functional groups, hawks and crows, and it achieved only moderate performance 
[15]. 


4 Data 


Input data of this application consist of digital images. All images for training the 
CNN have been taken manually at the test location in various weather conditions. The 
location is the same where the camera will be installed for taking images automati- 
cally. The collected image set was divided into two data sets: an original data set for 
training classifiers, and a test set for measuring generalization of the classifiers, and 
thus the classifiers will not see these test images during training. Both data sets are 
divided into 14 classes. It became clear during image collection that there would be 
low number of images of the scarcest bird species, resulting in classes with very low 
number of data examples. Therefore, in order to be able to classify the scarcest bird 
species, all the collected images are included with an acceptance that the resulting 
data set will be imbalanced. The distribution of the number of images for each class 
is used as an estimate for the actual distribution of bird species in the test area. This is 
justified by the fact that images are collected in all four seasons and in all hours dur- 
ing day light. The estimate is not necessary reliable in terms of bird species census, 
because only the species that usually fly at approximately same height with the wind 
turbines are taken into account, but it is sufficient in the context of this application. 
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The total number of images of the original data set is 24631, and number of images 
of the test data set is 439. The test data set was created by randomly choosing images 
from all classes. The number of images in the test data set follows the distribution 
of the original data set, thus reflecting the actual distribution in the test site. Class 
labels and number of images of the original data set for each class are presented in 
Table 1. In this Table, three classes are not defined in species level: LNSP, SWSP, 
and CATE. The first two cases are because there is no need to distinguish between 
loon species or swan species any further in this context, regardless the fact that two 
common, and two rare species of loons occur in the test area, and analogously there 
occur two common and one rare species of swans. The same applies to the third 
case too, the common/arctic tern. In addition, it is generally very difficult to tell the 
difference between these two tern species [16], and thus the number of required data 
examples (images) might be too large, considering the time needed to collect them. 
The number of images for each class in the test data set are also given in Table 1. 
No preprocessing, other than cropping, is applied to the images before feeding 
them into the classifiers. The cropping is based on a segmentation, and it is motivated 
by being able to dispose the most of the pixels representing only sky. The resolution 
of the camera sensor measured by the total number of pixels and the focal length of 
the lens are important qualities because of the long range, of which images are to 
be taken. The effective number of pixels (ENP) is defined by the number of pixels 
representing a bird. The remaining number of pixels are considered noise, thus ENP 
has a significant effect on the performance of image classification model as birds 
will be very small (they consist of only a small number of pixels) in the images. 


Table 1 The original data set divided into 14 classes 
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Fig. 2 Data example of the common goldeneye, the black-headed gull and the lesser black-backed 
gull, respectively 


ENP depends on the sensor resolution of the camera and the focal length of the lens, 
and if the sensor resolution is fixed, ENP can be increased by choosing a long (in 
terms of focal length) telephoto lens. An additional advantage is that there is no need 
to feed classifiers with large (in terms of the number of pixels) images. For more 
details about the segmentation, see [9]. Examples of the original data set images are 
presented in Fig. 2. The first image in this figure illustrates that there can be more 
than one bird in the image. There are species in the test area that have a custom to 
fly in tight flocks, and in these cases, the result (in terms of data examples) is an 
image of several birds. Moreover, there might be more than just one bird species in 
these flocks. The custom of flying in tight flocks is an important feature in terms of 
identification for certain bird species [17]. As result of the segmentation, an image 
has only one bird left when there is a sparse flock of birds in the image. In the sparse 
flock case, the bird closest to the center of the image is retained, thus when a sparse 
flock has more than one bird species, the retained bird species is chosen randomly. 
In the tight flock case, the identification is based on the whole flock, and thus it is 
biased toward the most numerous bird species in the flock. 


4.1 Data Augmentation 


Data augmentation is applied to the original data set. We have used our own method, 
in which the images are converted into various color temperatures according to step 
size, s. The lower and upper limit to the color temperature is 2000 K and 15000 K, 
respectively. For example, if s = 50 (in K), the number of data examples of the 
augmented data set is (15000—2000)/S50 + 1 * 24631 = 6453322. For more details, 
see [9]. In addition to the color conversion, the images are also rotated by a random 
angle between —20° and 20° drawn from the uniform distribution. Motivation for this 
is that CNN is invariant to small translations, but not image rotation [18]. Number of 
images for each species (classes) in the augmented data set when s = 50 and s = 200, 
respectively, are presented in Table 2. Figure 3 presents data examples of the output 
of the augmentation algorithm. The original image, fed into the data augmentation 
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Fig. 3 Data example of the common eider. The image on the left is an augmented image with color 
temperature of 3800 K. The original image is in the middle with color temperature of 5800 K. The 
image on the right is an augmented image with color temperature of 7800 K 





algorithm, has a color temperature of 5800 K, and the two augmented images have 
a color temperature of 3800 K and 7800 K, respectively. 


4.2 Grouping Data 


We trained our first classifier models on the entire data set, which was divided into 
the same number of classes as the data had species. However, there are more and 
less easily separable classes (assessed by human eye), and that led to an idea of 
grouping those species together that seem similar to human eye, thus our proposal 
in this respect is hierarchical [19]. In this approach, the number of classes decreases 
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on the top level of the classifier hierarchy, and thus resulting better separability of 
the data set. Classification inside the groups is dealt with the subsequent levels in 
cascades [20]. Figure 4 illustrates examples of clearly and weakly separable classes. 
The white-tailed eagle and the mute swan are examples of clearly separable classes, 
and the herring gull and the common gull are examples of weakly separable classes. 

There are four groups on the top-level of the classification hierarchy. Two of these 
groups are actually single clearly separable species; swans (treated as single species 
here) and white-tailed eagle, respectively. Gulls-and-terns and waterfowl (including 
loons and common cormorant) form the other two groups, respectively. More groups 
are defined below the top level in order to get species level classification. Division 
of the classes into the groups are given in Table 3. The number of images for each 
group formed from the original data set and from two augmented data sets with s = 
50 and s = 200, respectively, are given in Table 4. 





Fig.4 From left to right: the white-tailed eagle and the mute swan are examples of clearly separable 
classes. The herring gull and the common gull are examples of weakly separable classes 


Table 3 Division of the classes into groups 
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White-tailed eagle 164150 
Swans 17420 
Waterfowl-1 579550 
Waterfowl-2 203546 
Gulls-and-terns-1 889157 
Gulls-and-terns-2 373190 
Gray-backed gulls 515967 
Blackheaded-tern 239659 
Black-backed gulls 118858 


5 Classification 


All classifiers in this application share the same CNN model, which is shown in 
Fig. 5. Only the number of the neurons in the output changes according to the 
number of classes. This model has three convolution layers, each of which is fol- 
lowed by a rectified linear unit (ReLU) layer and the first two are followed by a 
cross-channel normalization layer (Local Response Normalization, LRN). The use 
of LRN is motivated by its ability to aid the generalization as its function may be 
seen as brightness normalization [2]. There are two max-pooling layers, the first is 
before the third convolution layer and the second is before the first fully connected 
layer. There is no max-pooling layer before the second convolution layer. The reason 
for this is the small ENP, and thus by omitting a max-pooling layer, all of the finest 
edges detected by the first convolution layer are transferred to the second convolution 
layer. The architecture is completed by three fully connected layers. The first two of 
them are followed by dropout layers, and each dropout layer is followed by ReLUs. 
The dropout was implemented by randomly setting the output neurons of the layer 
to zero with a probability of 0.5. The architecture is finally terminated by softmax 
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Fig. 5 The CNN model for each classifier 
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Table 5 Parameters for the convolution layers and the max-pooling layers of the CNN model 


Convolution 1 a2) | 
Max-pooing 1 22) | 100 
Convolution 3 my fa 
Max-pooing 22) | 100 


activation, which produces a distribution over the class labels with cross entropy 
loss function [21]. The input image is normalized and zero-centered before feeding 
it to the network. CNN with Mini-batch training and supervised mode as well as 
stochastic gradient descent with momentum 1s applied [21-24]. The L2 Regulariza- 
tion (weight decay) method for reducing over-fitting is also applied [21, 24, 25]. We 
kept the network size, in terms of free parameters, small due to limited capacity of 
computer resources. Thus, resulting in total of 92 feature maps which are extracted 
by convolution layers with kernel sizes [12 x 12 x 3] x 12, [3 x 3 x 12] x 16 and 
[3 x 3 x 16] x 64, respectively. Total number of weights is about 9.47 x 10°. 

Images of a size of 200 x 200 pixels are fed to each classifier. In the first convo- 
lution layer, this image size produces (200 — 12 + 2 * 1)/2 + 1 = 96 square feature 
maps, 1.e., there are 96 x 96 = 9216 neurons in each feature map. Filter size, number 
of feature maps, feature map size in neurons, stride, and padding for each convolution 
layers and max-pool layers are given in Table 5. For each filter, Fig. 5 displays the 
number of feature maps as the triplet [a, b, c]. 


5.1 Hyperparameter Selection 


We split the data set into a training set and a validation set as 70% and 30%, respec- 
tively. We used manual tuning for choosing the number of epochs. Initial weights for 
all layers are drawn from the Gaussian distribution with mean O and standard devia- 
tion 0.01. Initial biases are set to zero. The L2 value is set to 0.0005 and mini-batch 
size 1S set to 128. 


5.2 Feature Extraction 


The three convolution layers are designed to detect spatially distributed features from 
the training images. Usual disjunctive features are shape and general coloration of the 
bird. The ReLU (to introduce non-linearity) layer and the max-pooling (to increase 
spatial invariance) layer after the second and third convolution layers, respectively, 
may be seen as a refinement for the detected features due to the rectifying and down 
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sampling properties of these layers. Figures 6 and 7 illustrate the features, extracted 
by the CNN, for the classes LBBG and GBBG, respectively. These feature maps are 
from the second convolution layer. There is one frame for each 16 feature maps in 
the figure. These images are normalized, so that the minimum weight is O and the 
maximum is I, 1.e., the most negative weight has turned into zero (black). The mid- 
gray color (0.5) shows those areas in the image that have the minimum contribution 
to the features, and the most blackish or the most whitish areas denote maximum 
contribution to the features. The plain gray, or almost so, feature maps indicate that 
no significant features have been found in these maps. These feature maps show that 
the CNN is capable to give large weights on those areas of the bird plumage that are 
relevant for species identification. These areas are mainly: wing tips, feet, and a bill 
with these two pairs of gull species. As flying gulls usually have their feet concealed 


activations from conv2, one for each feature map in the layer 





Fig. 6 Visualization of the feature extraction by the CNN model for the class LBBG. There are 16 
feature maps in the figure extracted by the second convolution layer 
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activations from conv2, one for each feature map in the layer 








Fig. 7 Visualization of the feature extraction by the CNN model for the class GBBG. There are 16 
feature maps in the figure extracted by the second convolution layer 


by feathers, and their underside is not always visible in the images, the usage of this 
feature is minor. This leaves us the bill and the wing tip, and because the differences 
in the bill color and structure are only subtle, the most significant identification point 
is the wing tip. The great black-backed gull and the lesser black-backed gull also 
have a slight difference in the hue of their upper wing color, but this does not always 
seem to result in larger weights produced by the CNN for those areas, at least not 
large enough, because images of the great black-backed gull are even misclassified 
as the herring gull. Yet, the upper wing color is the key feature to distinguish between 
the gray-backed gulls and black-backed gulls (26). 
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Table 6 Filter sizes for the second modified model of the original CNN model 


Convolution! | 12 x 12 (1 1] 
Convolution 2 7x7 [1 1] [0 0] 
Convolution3 | 5x5 (1 1] [0 0] 
Convolution 4 4x4 [0 0] 
Convolution5 | 3 x 3 128 (1 1] [0 0] 


5.3 Tests for Deeper CNN Model 


It became clear during the development of this algorithm that the challenge, in terms 
of classification, lies in the group of gulls-and-terns-1, especially in the groups of 
gray-backed gull and black-backed gull. Considering the CNN model, the first option 
for a better performance should be a deeper model, 1.e., more convolution layers. We 
modified the original model by adding the fourth convolution layer, followed by 
ReLU and max-pooling layers. This model had 128 filters with filter size of [5 x 
5] in its fourth convolution layer. The first modified model was tested on the group 
of black-backed gull, but it failed to increase the performance of the original CNN 
model. Then we tested even deeper model by adding the fifth convolution layer, 
again followed by ReLU and max-pooling layers. In this case, the max-pooling layer 
before the third convolution layer in the original model was removed in order to have 
sufficient number of neurons left at the output of the architecture. We also modified 
the filter sizes of the second modified model. The modified filter sizes are given in 
Table 6. The two new max-pooling layers at the end of the second modified model 
have filter size of [2 2], respectively. When this model was tested on the group of 
black-backed gull, the result was the same as the first modified model, 1.e., it did not 
achieve a better performance than the original CNN model in terms of true positive 
rate (TRP). Both test classifiers were trained on the augmented set, with s = 50, of 
only the images from the group of black-backed gull. 


5.4 Dealing with Imbalanced Data 


If we want to identify (classify) all the species that occur in the test area, we must 
accept that the training data set will be imbalanced, because there will be low num- 
bers of training examples of the scarcest species. However, there are methods that 
can be used for imbalanced data set. Naturally, the first option would be to collect 
more data into the training data set, but this is not a very realistic option in our case. 
Resampling is a method that is easy to implement, and fast to run. This means that 
copies of data examples are added into the under-represented class, 1.e., over- 
sampling, or data examples are deleted from the over-represented class, 1.e., under- 
sampling [11]. However, we have augmented the original data set (resampling is 
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Table 7 Imbalanced ratios of Class label 


13 classes to the class with Ratio 

the largest number of images WTEA 1:2 

(GRCO) ee 53 
LOSP 1:14 
COEI 1:6 
COGO 15 
VESC 1:24 
RBME 1:21 
GBBG 1:11 
HEGU 1:1 
LBBG 1-4 
COGU 12 
BHGU 1:3 
CATE 13 


not used) with s = 50, and trained a reference classifier on the augmented data set. 
The results, in terms of performance, of the hybrid model (hierarchical and cascade 
model combined) trained on the grouped data set are compared with this reference 
classifier. The grouped data set is also augmented with s = 50, and both data sets are 
imbalanced. Class imbalance ratios (1.e., ratio of the number of images in a class to 
the class with the largest number of images) of the original data set for 13 classes, 
rounded to the nearest integer, are given in Table 7. The class with the largest number 
of images is GRCO, and it is omitted from the table. It can be seen from the table 
that there is severe imbalance between several classes and the class GRCO. 

Another reference classifier is trained on a balanced data set. This data set is 
created by under-sampling method, so that the original data set is augmented with 
s = 50, and then 236 x 262 = 61832 images are randomly chosen from each class, 
except for the class VESC, from which all of the images are chosen, because this 
class has the lowest number of data examples. 

It is important to choose a suitable performance metric for classifiers trained 
on imbalanced data set. We have used confusion matrix as a tool to compare the 
classifiers. Precision (a measure for classifier exactness) and recall (a measure for 
classifier completeness, a.k.a. TPR) are metrics that have been calculated from con- 
fusion matrices. Receiver operating characteristic (ROC) curves and histograms of 
predictions are the tools that have been applied to determine thresholds for various 
classifiers trained on the grouped data set. Histograms present the predictions of a 
classifier fed by a test data set, which the classifier has never seen before. Thus, his- 
togram shows the distribution of prediction of a classifier over a class by presenting 
the number of the predicted probabilities that falls into each bin. There are always 
only two classes in the histograms: the positive class (in red), and the negative class 
(in blue). If it is necessary to use histograms for more than two classes, then one 
of the classes is treated as the positive class, and the other classes are combined to 
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form the negative class. For all histograms, the x-axis is probability, and y-axis is the 
number of hits for each bin. Y-axis ranges from zero to the largest probability that 
hits a single bin in the histogram. The number of bins is always set to 10, and thus 
the bin width is 0.1. 


5.5 Hybrid of Hierarchical and Cascade Model 


In order to improve the performance of the classification algorithm compared to the 
single CNN classifier, we can use more than just one classifier, and we can divide the 
data set into suitable groups, which is done in Sect. 4.2. Eight classifiers have been 
trained on the grouped data sets. These classifiers form a hierarchy that is applied to 
classify the original data set. This architecture may also be seen as a hybrid between 
hierarchical and cascade models. The architecture is depicted in Fig. 8. The level of 
a Classifier in the hierarchy, the data set (the groups) that it has been trained on, and 
the number of classes for each classifier are given in Table 8. 

The first idea was merely to use cascade model of classifiers, so that the commonest 
species, determined by the distribution of the data set, would be filtered out (classified) 
at the first classifier. The second commonest species would be filtered out at the 
second classifier, and so forth. However, the early tests showed that there is no 
significant difference in performance, if those classes with better separability would 
be classified in a single classifier. Moreover, the cascaded approach would have led 
to a relatively large number of classifiers to be trained on, and thus increasing the 
training time. Nevertheless, the cascaded model was applied to some groups, and 
especially to the groups of the weakest separability, which are the gulls-and-terns 
groups. The prediction of a classifier for a test image is the vector-output of the 
softmax-layer, and it is given, 


P=([pi, P2,---, Pn), (1) 


where p; is a probability for a class; as aresult of the classification, and nis the number 
of classes. Classes are alphabetically ordered by their class labels. Thresholds are 
applied as follows: 


1, if p; > threshold;, 
i= V., 2 
“ { otherwise, 2) 
CS Cine cce x Cals (3) 


where p; 1s as in the P-vector (1), and threshold; is the threshold for class;. As result 
of Eqs. (2) and (3) there will be exactly one element, c;, turned to one in C-vector, 
and the rest of the elements are turned to zeros. The class label is found according 
to the index of the element that is turned to one: 
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Table 8 Classifiers trained for the hybrid model 


COTE DN) nm BR] WwW] nN] 
NMI NININI| A!TNI NHN] s 


j =argmax,(C), (4) 


where j is the index of the predicted class. 


5.6 Top-Level Classification 


The top-level classifier is the most important in terms of TPR, because a possible 
misclassification will recur in subsequent hierarchy. This classifier deals with the 
groups: swans, waterfowl-1, white-tailed eagle, and gulls-and-terns-1. Class imbal- 
ance ratios of the top-level groups, rounded to the nearest integer, are given in Table 9. 
Considering the environmental license requirements, it is crucial to keep the number 
of false negative (FN, a data example from the positive class that is misclassified 
as the negative class) of the group of white-tailed eagle and the group of gulls-and- 
terns-1 as low as possible, preferably at zero. Figures 9 and 10 illustrate the choice 
of possible threshold for the group of white-tailed eagle, and for the group of gulls- 
and-terns-1I, respectively. These histograms are formed from the predictions of the 
top-level classifier (a.k.a. primary classifier), so that a histogram for the positive class 
and the negative class are plotted in the same graph, respectively. Equivalent ROC 
curves are computed based on the histograms, from which the TPRs and false posi- 
tive rates (FPR) are calculated. ROC curves for the group white-tailed eagle and for 
the group gulls-and-terns-1 are shown in Figs. 11 and 12, respectively. Both figures, 
the histogram and the ROC curve, for the group of white-tailed eagle show that this 
group 1s clearly separable, and thus it is easy to choose a suitable threshold for perfect 


Table 9 Imbalanced ratios of 


Class label Ratio 
the top-level groups to the 
group with the largest number White-tailed eagle 1:5 
of images (gulls-and-terns-1) Swans 1:51 


Waterfowl-1 12 
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Fig. 9 Histogram for the group white-tailed eagle 
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Fig. 10 Histogram for the group gulls-and-terns-1 


classification of the group. Generally, two values of probability can be read from the 
histogram, and use as a threshold: the lowest probability value of the positive class 
(LPPC), and the highest probability value of the negative class (HPNC). 

In the case of the group of white-tailed eagle, these two probability values are not 
overlapped, and thus this class is clearly separable. The threshold can be set anywhere 
between ().8 and 0.9 in order to classify this class perfectly. All true positives (TP, 
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Fig. 11 ROC curve for the group white-tailed eagle 
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Fig. 12 ROC curve for the group gulls-and-terns-1 


a data example from the positive class that is correctly classified) will be classified 
correctly and there will be no false positives (FP, a data example from the negative 
class that is misclassified as the positive class), nor FNs. As the group of white- 
tailed eagle actually consists of only that single bird species, this also means that 
this classifier is capable to classify the white-tailed eagle in accordance with the 
environmental license. 
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In the case of the group of gulls-and-terns-1, the LPPC and the HPNC are over- 
lapped. There are two data examples from the negative class that have the probability 
between 0.9 and 1, and all of the probabilities from the positive class fall into the same 
bin. Probabilities inside the bins cannot be read from the histograms, but the plotting 
software (MatLab) also prints the exact values for the probabilities. The LPPC and 
the HPNC for the group of gulls-and-terns-1 is 0.9000 and 0.9643, respectively. If 
we choose 0.9 for the threshold, there will be two FPs, but if we choose 0.9643 for 
the threshold, there will be no FPs, nor FNs. In this case the number of FNs is the 
most important, because the lesser black-backed gull belongs to the group of gulls- 
and-terns-1, and it is particularly taken into account in the environmental license, 
So we cannot take the risk of misclassifying a gull at the top level of the hierarchy. 
Therefore, we must choose 0.9643 for the threshold. Table 10 shows the applied 
threshold for the top-level group. The threshold for the group of white-tailed eagle 
is set to 0.7415, because this is the exact value printed by the plotting software. One 
image of the great cormorant in the test data set is misclassified as the white-tailed 
eagle, and this causes the number of FPs to be one for the white-tailed eagle and 
the number of FNs to be one for the group waterfowl-/. This is acceptable error 
rate, because no white-tailed eagles are missed. Algorithm 1 describes the top-level 
classification process. This algorithm also defines a new pseudo-class, which means 
that this class does not exist in the data set, but it is used when the primary classifier 
fails to classify a test image correctly. Thus, it enables a definition of an unidentified 
bird (UNBJ) class without explicitly including it in the real-world classes. 


Algorithm 1: Classification on the top level. 
Data: A test image 
Result: Classification of the test image 
image = zeroCentering(testImage); 
TopLevelPrediction = classify(PrimaryClassifier, image); 
if TopLevelPrediction > thresholdWhite TailedEagle then 
| return WTEA; // the white-tailed eagle 
else if TopLevelPrediction > thresholdSwans then 
| return SWSP; // swan species 
else if TopLevelPrediction > thresholdGullsAndTerns then 
| // Classification for gull and tern species continues in Algorithm 3 
else if TopLevelPrediction > thresholdWaterfowl then 
| // Classification for waterfowl species continues in Algorithm 2 
else 
| return UNBI; // a pseudo-class for unidentified bird species 
end 


Table 10 Thresholds for the Group Threshold 
top-level group 
White-tailed eagle | 0.7415 OO 
Swans 0.7000 OO 
Waterfowl-1 0.0083 


Gulls-and-terns- 1 





252 J. Niemi and J. T. Tanttu 


5.7 Classification of Waterfowl 


Waterfowl are classified in the second level of the hierarchy, so that two classifiers 
are cascaded. The first one filters out the commonest class, GRCO, and all the other 
waterfowl are classified in the second classifier. Thresholds for the first classifier are 
given in Table 11. There is one EN, and accordingly, one FP as these thresholds are 
applied. The misclassified class is GRCO, which is the only class in the group of 
cormorants. Thresholds for the group of waterfowl-2 is given in Table 12. All other 
classes have one FN, respectively, except for the class LOSP, which is clearly sepa- 
rable. Algorithm 2 shows the classification process for both waterfowl groups. Two 
new pseudo-classes are defined in the algorithm: unidentified waterfowl (UNWF), 
and unidentified small waterfowl (UNSW). 


Algorithm 2: Classification for the waterfowl groups. 
Data: Image from the top level classifier in Algorithm 1, when TopLevelPrediction 
> threshold Waterfow! 
Result: Classification of the test image 
predictionWF1 = classify(classifier_3.1, image); 
if predictionWF1 > thresholdCormorant then 
| return GRCO; // the great cormorant 
else if predictionWFI > thresholdWF2 then 
predictionWF2 = classify(classifier_3.2, image); 
if predictionWF2 > thresholdLoons then 
| return LOSP; // loon species 
else if predictionWF2 > thresholdGoldeneye then 
| return COGO; // the common goldeneye 
else if predictionWF2 > thresholdEider then 
| return COEI; // the common eider 
else if predictionWF2 > thresholdMerganser then 
| return RBME; // the red-breasted merganser 
else if predictionWF2 > thresholdScoter then 
| return VESC; // the velvet scoter 
else 
| return UNSW; // a pseudo-class for small unidentified waterfowl 
end 
else 
| return UNWF; // a pseudo-class for unidentified waterfowl 
end 


Table 11 Thresholds for the Gia T apel Threshold 
group waterfowl-1 
Cormorant 0.2627 
Waterfowl-2 0.7000 io 
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Table 12 Thresholds for the 


Class label 
group waterfowl-2 ass 1abe 


LOSP 
COEI 
COGO 
VESC 
RBME 





oe Pe Number OF Maes Pair of groups # Positive class | # Negative class 
in the larger test data sets for 
gulls and terns { gray-backed-gulls, 

gulls-and-terns-2 } 


{blackheaded-tern, 108 
black-backed-gulls } 
{HEGU, COGO} 89 
: 
“oe 


{BHGU, CATE} 
{LBBG, GBBG} 


5.8 Classification of Gulls and Terns 


Gulls and terns are classified in the second and third level of the hierarchy in cascade 
classifiers. We have used a larger test data set with more images for the groups of 
gulls-and-terns, which is enabled by the fact that the scarcest classes in the original 
data set are not included in these groups. In this way, we gain more robust threshold, 
though the original distribution 1s retained, and the test data set still has only images 
that the classifiers have never seen before. The number of images in these data set sets 
are given in Table 13. In this Table, the pair of groups or classes is in the first column 
from the left, so that the positive class is mentioned first. The following two columns 
are the number of images of the positive class and the negative class, respectively. 
We can calculate from the table that the class imbalance ratio for the most of the 
pairs of the fourth level of the hierarchy (the species level) is almost balanced. The 
pair {LBBG, GBBG} is the only significant exception having the class imbalanced 
ratio of 1:3, and because this pair also has the weakest separability, poor result for 
classification is expected in terms of TPR. 

At the second level the commonest group, gray-backed-gulls is filtered out first 
from the group of gulls-and-terns-2, and then subsequently the group blackheaded- 
tern. Finally, the group black-backed-gulls is the only one left. Figure 13 shows the 
histogram for the group of gray-backed-gulls. It becomes clear from the histogram 
that the distributions of the positive class (gray-backed-gulls) and the negative class 
(gulls-and-tern-2) are overlapped, and that given we must make a choice for a suitable 
threshold while keeping in mind the terms of the environmental license. There are 
two choices for the threshold: 0.6000 with one FP, and 0.7590 with two FNs. We must 
choose 0.7590 even though it means weaker general performance for the hierarchical 
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Fig. 13 Histogram for the group gray-backed-gulls 
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Fig. 14 Histogram for the class HEGU (herring gull) 


classifier. This is because we do not want any member of the class LBBG misclassified 
on the second level. As result of applying this threshold, there will be two images of 
the group gray-backed-gulls misclassified as gulls-and-terns-2. 

Species level classification is reached on the fourth (third for gray-backed gulls) 
level of the hierarchical classifiers. This includes pairs of classes with the weakest 
separability: {HEGU, COGU}, and {LBBG, GBBG}. The overlap of the distribu- 
tions of the positive and the negative classes are illustrated in Figs. 14 and 15. The 
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Fig. 15 Histogram for the class LBBG (lesser black-backed gull) 


class HEGU is the positive class in Fig. 14, and the class COGU is the negative 
class. The class LBBG is the positive class in Fig. 15, and the class GBBG 1s the 
negative class. The best value for a threshold for separating the HEGU from COGU, 
in terms of classifier performance, is 0.2134. This means zero FP, but eleven FNs, 
1.e., eleven images of herring gulls will be misclassified as common gulls. Classi- 
fication of the pair {BHGU, CATE} 1s straightforward owing to the fact that with 
the chosen threshold it has the number of FNs and FPs equal to zero. See Table 14 
for thresholds for the groups of gulls and terns. The best option for a threshold of 
the class LBBG is 0.9993 when the number of FNs is 12. Algorithm 3 shows the 
classification process for the group of gulls-and-terns-1. Five new pseudo-classes 
are defined in this algorithm: gray-backed gull (GBGU, either the herring gull or the 
common gull), black-headed (BHTE, either the black-headed gull or tern species), 
black-backed gull (BBGU, either the lesser black-backed of the great black-backed 
gull), non-gray-backed-gull (NGGU, either BHTE or BBGU), and unidentified gull 
or tern (UNGU). 


Table 14 Thresholds applied to the pair of groups of gull and tern species 


Class label # FPs 
{ gray-backed-gulls, gulls-and-terns-2 } 0 

{blackheaded-tern, black-backed-gulls} ot re 
(HEGU, COGO} 
(BHGU, CATE} ose 
(LBRO, GBBG| 


Oo;o;o|o 
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6 Results 


Results for comparing the classifiers are viewed through generalization. The hybrid 
of hierarchical and cascaded model achieved average performance of 0.9460 (TPR). 
The reference classifiers have average TPRs as follows: for the imbalanced reference 
classifier (IMBRC), 0.8195 and for the balanced reference classifier (BRC), 0.8307, 
respectively. The total number of misclassification for the hybrid model was 16. 
This number for the reference classifiers was 45 for the IMBRC, and 71 for the 
BRC. Average precision for the hybrid model was 0.9619. Average precision for 
the IMBRC was 0.8809, and for the BRC 0.7919. TPRs for the top-level groups 
and the class LBBG are given in Table 15. The reference classifiers were trained on 
ungrouped classes, therefore the numbers for TPRs of the groups have been averaged 
of the numbers of those classes that the groups consist of. 

Confusion matrix for the top-level groups is given in Table 16. This confusion 
matrix also includes the pseudo-class UNBI. Naturally, the number of TPs are zero 
for pseudo-classes, because these classes are only defined for failure of the classifiers. 
Confusion matrix for the classes are given in two parts, because it is too big to fit in 
the page. Table 17 presents the first part of the confusion matrix including the group 
of waterfowl-1, 1.e., the classes: LOSP, GRCO, COEI, COGO, VESC, and RBME. 
This confusion matrix also includes the pseudo-classes: UNSW, and UNWF. One 
test image of GRCO is presented in the top-level confusion matrix, therefore the 
number of test images for the class GRCO is 99 in the waterfowl confusion matrix. 
Table 18 presents the second part of the confusion matrix including the group gulls- 
and-terns-I (the classes: GBBG, HEGU, LBBG, COGU, BHGU, and CATE). The 
five pseudo-classes defined in Algorithm 3 for gulls and terns are omitted in order to 
save space, and because no image of any of the subgroups of the gulls-and-terns-1 
was misclassified as any of these pseudo-classes. 


Table 15 TPRs for the hybrid model and the reference classifiers 


Imbalanced reference 0.6923 
Balanced reference 0.8846 





Table 16 Confusion matrix for the top-level groups of the hierarchy 


WTEA ™ jo [o [0 1 
SWsP ois) oo 
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Algorithm 3: Classification for the group gulls-and-terns. 
Data: Image from the top level classifier in Algorithm 1. when TopLevelPrediction 
> thresholdGullsAndTerns 
Result: Classification of the test image 
predictionGullsTerns = classify(classifier_4.1, image): 
if predictionGullsTerns > thresholdGrayGulls then 
predictionHerringCommon = classify(classifier_4.1.1, image): 
if predictionHerringCommon > thresholdHerring then 
| return HEGU; // the herring gull 
else if predictionHerringCommon > thresholdCommon then 
| return COGU; // the common gull 
else 
return GBGU; // a pseudo-class for a ’gray-backed gull’ i.e., 
either the herring gull or the common gull 
end 
else if predictionGullsTerns > thresholdGullsTerns2 then 
predictionGullsTerns2 = classify(classifier_4.2, image); 
if predictionGullsTerns2 > thresholdBlackHeaded then 
predictionBlackHeaded = classify(classifier_4.2.1, image): 
if predictionBlackHeaded > thresholdBHGU then 
| return BHGU; // the black-headed gull 
else if predictionBlackHeaded > thresholdCATE then 
return CATE; // the common/arctic tern 
else 
return BHTE; // a pseudo-class for either the black-headed 
gull or tern species 
end 
else if predictionGullsTerns2 > thresholdBlackBacked then 
predictionBlackBacked= classify(classifier 4.3, image): 
if predictionBlackHeaded > thresholdLBBG then 
| return LBBG; // the lesser black-backed gull 
else if predictionBlackHeaded > thresholdGBBG then 
| return GBBG; // the great black-backed gull 
else 
return BBGU; // a pseudo-class for either the lesser 
black-backed of the great black-backed gull 
end 
else 
| return NGGU; // a pseudo-class for ’non-gray-backed gull’ 
end 
else 
| return UNGU; // a pseudo-class for unidentified gull or tern species 
end 


As the number of images in the test data set is 439, we must split this number 
between the three confusion matrices. The first confusion matrix 1s for all 439 images, 
but because the classes WTEG and SWSP are only presented in this confusion matrix, 
the sum of the number of test images in the other two confusion matrices 1s 439 — 
50 = 389. The second confusion matrix presents results for 153 test images and the 
third for 236 test images, so that the sum is 153 + 236 = 389 test images. 

Confusion matrix for the IMBRC is given in two tables: Tables 19, and 20, respec- 
tively. The class WTEA is included in both tables, because there are FPs and/or FNs 
for it in the two tables. However, the number of TPs for the class WTEA is only 


258 J. Niemi and J. T. Tanttu 


Table 17 Confusion matrix for the classes of the group waterfowl-1 
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Table 18 Confusion matrix for the classes of the group gulls-and-terns-1 


Table 19 Confusion matrix for the imbalanced reference classifier, part one 
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given in the first table. The same test data set, as with the hybrid model, has been 
used when the reference classifiers were tested. The total number of images, 439, is 
again divided into two tables. The first table covers 197 test images and the second 
table covers 242 test images. 

Confusion matrix for the BRC is also given in two tables: Tables 21 and 22, 
respectively. There are no FPs or FNs for the class of WTEA in the second confusion 
matrix, so the class can be omitted from this table. 

The results for the modified CNN models compared to the original CNN model 
are given in Table 23. All three models were trained on the same augmented data 
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Table 20 Confusion matrix for the imbalanced reference classifier, part two 
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Table 21 Confusion matrix for the balanced reference classifier, part one 
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RBME | 0 0 +0 0 00 


Table 22 Confusion matrix for the balanced reference classifier, part two 
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Table 23. TPRs for the 
modified CNN models 
compared to the original 
CNN model 
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set, which only consisted of the images from the group of black-backed gull. These 
models were tested as single classifiers. There are TPRs for both training and gener- 
alization (tested on images that the classifier has never seen before) in the Table. The 
first modified model had four convolution layers, and the second had five convolution 
layers. The models were tested only on the group of black-backed gull in these tests. 


7 Discussion 


The tests showed that the hybrid model is significantly better, in terms of perfor- 
mance, than the reference classifiers. The only problematic class, in terms of the 
environmental license, is the LBBG. Even though it had the number of FNs zero 
in the test for the hybrid model, the number of FNs was 12 in the test for the gulls 
and terns only. The number of test images in the latter test was larger, and this gives 
insight into real-world implementation. The number of possible FPs is not signifi- 
cant in this context, because it would just mean that other gull species, more likely 
great black-backed gulls, are misclassified as LBBG. Therefore, it is advisable to 
combine the classes LBBG and GBBG into a single class, 1.e., not classify the group 
black-backed-gulls any further. 

The BRC performed better than the IMBRC, in terms of TPR. However, the 
number of misclassification is 71 for the BRC and 45 for the IMBRC. The difference 
is explained by the better average precision of the IMBRC. Precision increases as the 
number of FPs decreases, and TPR increases as the number of FNs decreases. This 
means that TPR is more significant metric than precision in our context, and thus 
the BRC would be the second choice after the hybrid model. The IMBRC showed 
poor performance even though it was trained on larger data set (6.45 * 10° versus 
8.66 * 10°) than the BRC. This implies that straightforward use of a single classifier 
on an imbalanced data set gives poor performance in terms of TPR. This result is 
based on relatively low number of data examples, which is often the case in real-world 
application, but this method could perform better when trained on significantly larger 
training data set. However, if precision is an important criterion, then this method 
may be considered for real-world usage. 

The top-level group has the number of FPs equal to one in its confusion matrix 
(Table 16). This FP is a misclassified GRCO as WTEA. This is, of course, a FN for 
the class GRCO. However, this is acceptable as no WTEA is misclassified, and thus 
the number of FNs for the class WTEA is zero. The group waterfowl-I also shows 
good results, and there are only five misclassification. It seems that grouping the 
original classes is useful approach to this kind of real-world classification problem. 
By grouping, you can confine the most difficult classification problem to the one 
group or even just to one subgroup. This approach indicates where the challenge 
lies. In this context the challenge are the groups of gray-backed-gulls and black- 
backed-gulls, respectively. The bird species that these groups consist of are very 
similar in terms of morphology. This leads to a conclusion (assessed by human eye) 
that the overlapped area of the classification boundary is clearly wide for both groups. 
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This suggests that significant increase of classification performance can be achieved 
only by collecting more images of these groups. 

Surprisingly, the modified CNN models with architecture of more than three 
convolution layers, did not perform better than the original CNN model. This implies 
that the original model, with the architecture of three convolution layers, is capable 
to extract all relevant features from the training images, and additional convolution 
layers cannot adduce any more information. 

The measured performance of the image classification has been obtained without 
using the parameters supplied by the radar. Those parameters, especially the speed 
of an object, provide additional and relevant a priori knowledge to the system. It is 
measured by the radar system that there are significant differences in flight speed 
between the groups of waterfowl-/ and gulls-and-terns-1, and this can be utilized to 
turn a misclassification into the correct one. 
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Deep Learning for Person M®) 
Re-identification in Surveillance Videos Check for 


Swathi Jamjala Narayanan, Boominathan Perumal, Sangeetha Saman 
and Aditya Pratap Singh 


Abstract In the recent years, Closed Circuit Television (CCTV) is viewed as the 
basis for providing security. One of the most important aspects of CCTV surveillance 
systems security mechanism is to re-identify a person captured in one of the camera 
across different surveillance cameras. Re-identification has a major role in several 
applications like automated surveillance of universities, offices, malls, home and 
restricted environments like embassies or laboratories with strong security restric- 
tions. Traditionally, identifying a person in a video was practiced under the set of 
same external conditions (like same illumination, viewpoint, back ground conditions 
etc.). But when it comes to automated re-identification in a CCTV surveillance sys- 
tem, several challenges emerge as the environment is uncontrolled and keeps varying, 
further the poses of the person and the angles of the cameras capturing the videos also 
incur additional challenge for the task considered. When a person disappears from 
one camera view for a period of time, he should be recognized in another view of cam- 
era at a different location when there are environmental disturbances like variation 
in illumination, crowded scene, partial occlusions, physical appearance variations, 
full occlusions, view point variations, background clutter, shadows and reflections, 
etc. In this chapter, the major focus is on the techniques of deep learning used to 
develop an end-to-end re-identification system highlighting the methods to handle 
the uncontrolled environment challenges mentioned. An end-to-end re-identification 
task consists of sequence of steps namely pedestrian detection, person tracking fol- 
lowed by person re-identification. Given a video sequence or an image as an input, 
firstly the humans are detected from the video sequence as a process of pedestrian 
detection. The person tracking within the camera is conducted, to find the different 
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poses of the probe if needed. Then the re-identification process is conducted where 
the deep learning models are used to re-identify the person with the help of gallery 
set of videos and evaluates the similarities of gallery set and the person of interest 
by using deep learning metrics. The re-identification results end as a retrieval pro- 
cess where all similar images of the person of interest are retrieved. Several bench 
mark datasets considered in literature for re-identification system are VIPeR, ETHZ, 
PRID, CAVIAR, CUHKO1, CUHK02, CUHKO3, i-LIDS, RAiD, MARS, etc. 


Keywords Deep learning - Re-identification - Video surveillance 


1 Introduction 


The most important aspect of any intelligent closed-circuit television (CCTV) surveil- 
lance system is to accomplish the task of re-identification of humans which 1s popu- 
larly termed as Person Re-identification [1, 2]. The objective of such system is to find 
out whether a person showing up 1n one camera is coming again in another camera 
1.e. to determine whether a pair of humans appearing in various cameras with non- 
overlapping views [3] has the same identity or different identity. Engaging or hiring 
human operators to track the person of interest would be highly time-consuming 
process as they need to spend more time and most of the time it ends as an exhaus- 
tive task for the operators. To overcome this situation, automated computer vision 
system with less human intervention is more suitable to assist the human operators 
in identifying a person over a set of non-overlapping or disjoint cameras. The advent 
of conducting research work in this area is to increase demand of public safety with 
the help of widespread large camera networks placed in and around the public places 
like theme parks, shopping malls, universities, airports, etc. To achieve the above- 
mentioned goals, it is very costly to depend entirely on human workers in order to 
accurately recognize or track a person of interest across several cameras. 

In early days, the person re-identification was considered as a multi-camera track- 
ing problem where appearance based models were used with the geometry calibration 
with disjoint cameras in the surrounding environment. In the year 2005, the term re- 
identification was coined by Zajdel et al. [4] from the university of Amsterdam where 
they tried to re-identify a person when he departs from the camera view and appears 
once again [4]. In the year 2006, Gheisasri et al. [5] applied spatio-temporal seg- 
mentation algorithm and used visual signs of the persons as input a for foreground 
detection. The problem was solved as image-based Re-id rather than as video based 
one. This was the first work representing the isolating the person Re-id from multi- 
camera tracking. Henceforth, the problem of Re-is is considered as an independent 
computer vision task. Further in the year 2010, there were two major works which 
proved that using multiple frames per person would effectively improve over the 
single frame version [6]. In the year 2014, Yi et al. [7] and Li et al. [8] employed 
Siamese neural network successfully for person re-identification problem where a 
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pair of input images where correctly determined with same identity using the net- 
work. This network helped to overcome the major issue of re-id problem which was 
having a lack in number of training samples. In the same year in 2014, Xu et al. pro- 
posed an end-to-end image-based re-identification model [9] where they combined 
the concept of detection and re-identification scores. The detection was used to find 
the commonness and re-identification was performed to find the uniqueness of the 
persons. The problem of person re-identification can be solved using any of the two 
systems namely handcrafted systems and deep learning systems. Re-identification 
system includes two components namely pedestrian detection and distance metrics. 
In handcrafted systems the features are extracted and passed on to re-identification 
system where as in deep learning systems, learning the features is inherent in the deep 
learning architecture and provides improved results compared to hand crafted sys- 
tems. This chapter focuses on the preliminaries of deep learning algorithms followed 
by re-identification datasets and the different architectures, activation functions, loss 
functions and evaluation protocols used in re-identification application. 


2 Preliminaries of Deep Neural Networks 


This section discusses in brief on the basic deep learning models used in com- 
puter vision task. The models discussed are Convolutional Neural Network, Le- 
Net5, AlexNet, ZFNet, VGGNet, GoogleLeNet, ResNet, Recurrent Neural Network, 
Siamese Neural Network. All these networks are based on CNN as a basic model 
and they vary in their architecture with respect to number of hidden layers, activation 
functions, loss functions and training mechanism. 


Convolutional Neural Network 

In the domain of deep learning, most of the works on Convolutional Neural Networks 
(CNN) were performed to analyze visual images [10]. The network limits the process 
of including preprocessing step since the network learns the features automatically 
using filters and hence this avoids the feature design process. The convolution neural 
network is composed of three main layers having input, output and multiple hidden 
layers. The hidden layer further consists of convolution layers, activation functions, 
pooling layers, fully connected layers and normalization layers. The convolution 
layer employs the convolution operation on the input and forwards the result to the 
subsequent layer, where each neuron handles the data only for its receptive field. This 
avoids using a greater number of weights and allows the network to be deeper with 
less parameter. The activations functions commonly used in CNN are ReLU, Tanh 
and Softmax activation function. The pooling layers aim to continuously decrease 
the spatial size of the features in order to help in reducing the number of parameters 
and computations in the network. The pooling layer operates on each feature map 
independently. The commonly used pooling operations are max pooling and average 
pooling. The fully connected layer is to connect each neuron in one layer to every 
neuron in another layer. The receptive field (input area of the neuron) of each layer 
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varies. As in fully connected layer, the convolution layer doesn’t take input from 
every element of the prior layer, besides it takes input from a controlled subarea of 
the previous layer. Having weights and bias in each neuron, these weight vectors and 
bias vectors form a filter representing some feature of the input. The main strength 
of the CNN is that lots of neurons share the same filter thus eliminating the memory 
track of each receptive field taking up their corresponding bias and weight vectors. 
The other distinguishing feature of CNN is that it has 3D volume of data in terms 
of width, height and depth. Second feature is that it has layers of different types 
connected locally and completely and further stacked to build the CNN architecture. 
The architecture guarantees that the trained filters generate results to a spatially local 
input pattern and as the layers increase and get stacked up it would lead to nonlinear 
filters that become gradually global. The third distinct feature is on shared weights, 
1.e., each filter getting duplicated across the layers. The basic CNN architecture is 
given in Fig. 1. 


LeNet-5 
LeNet-S is a kind of convolutional network which was mainly intended for perform- 
ing handwritten and machine-printed character recognition [11]. The network has a 
total of 7 layers which includes two convolutional layers, two pooling layers, two 
fully connected layers and one output layer. LeNet-5 uses 5 x 5 kernels of stride 1 
and 2 x 2 subsampling of stride 2. It is considered as the base model for various 
other successful deep CNN architectures. Figure 2 represents the LeNet-5 architec- 
ture where Fig. 2a demonstrates the architecture with subsampling or max-pooling 
layers whereas it is not a major focus of representation in other architectures like 
AlexNet. The same is represented in Fig. 2b. In the current architectural representa- 
tions, max pooling is replaced in place of subsampling layer and they also occur less 
frequently than convolution layers. 

LeNet-5 is vastly narrow in accommodating the recent standards. The architecture 
retains the basic principles and the most commonly used activation function is the 
sigmoid activation function. It accommodates RBF units in the final layer having the 


Deep Learning for Person Re-identification in Surveillance Videos 267 


CONVOLUTION OPERATIONS 


(a) 







7 
nm 


Oo 
INPUT: GRAYSCALE Oo 
FEATURE MAP Oo 
OF PIXELS : 
32 oO 
oO 

ET 10 

(b) 


INPUT: GRAYSCALE 
FEATURE MAP 
OF PIXELS 


O00+**000) 


SUBSAMPLING/MAX-POOLING SHOWN IMPLICITLY AS “SS” OR “MP” 


Fig. 2 a Detailed architectural representation of LeNet-5 [12], b LeNet-5 brief representation [12] 


prototype of every unit relating to the input vector and the output produced is the 
Squared Euclidean distance between them. In recent standards, the practice of using 
RBF is avoided and instead softmax units with log-likelihood loss on multinomial 
label outputs are used. The major applications of LeNet-5 is on character recognitions 
and is widely used to read the characters in bank cheques. 


AlexNet 

AlexNet is a 8 layered CNN architecture that won ImageNet challenge 2012 and 
produced the widespread popularity for CNN architectures in the area of computer 
vision [13] Fig. 3 demonstrates the AlexNet architecture. 

In Fig. 3, each convolution layer follows ReLu activation functions which is not 
explicitly shown and the max pooling layers denoted as MP follow only subset of 
convolution-ReLU combination layers .The architecture is composed of 5 convo- 
lutional layers and 3 fully connected layers. The first convolution layer comprises 
of 96 filters of size 11 x 11 at stride 4 and second convolution layer consists of 
256 filters of size 5 x 5. Third, fourth, and fifth convolution layers consists of 384, 
384, and 256 filters respectively of size 3 x 3 at stride 1. First, second, and final FC 
layers consists of 4096, 4096, and 1000 neurons. The most significant characteristic 
of the AlexNet is the use of non-linear activation function (ReLU) and it also uses 
heavy data augmentation. The role of ReLu activation function towards increasing 
the training speed of CNN is firstly exhibited in AlexNet architecture. This proved 
ReLu is far better than steeping activation functions like sigmoid or tanh. The hyper 
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Fig. 3 The AlexNet architecture without GPU partitioning [12] 


parameters(batch size, SGD momentum and learning rate) of AlexNet were set to 
128, 0.9 and 0.01. 


ZFNet 
ZFNet architecture the winner of ISLVRC 2013 is almost similar to AlexNet shown 
in Fig. 4. 

The key difference exists in a few hyper-parameters set and further the architec- 
tural changes made were in the first layers filter size and the stride. The filter size of 
11 x 11 was reduced to 7 x 7 and the stride of convolution 2 was used instead of 
stride 4. In convolutional layer 3, 4 and 5 the number of filters used were changed 
from 384, 384, 256 to 512, 1024, 512 respectively. 


VGGNet 

VGGNet [15] presented at ILSVRC 2014 is the runner up in the competition which 
composed of 16 convolutional layers. The architecture seems to be interesting due to 
its uniform architectural style followed. The same is shown in Fig. 5. The architecture 
reduced the complexity of using huge filter sizes with huge strides as used in AlexNet, 
rather throughout the entire network, VGGNet uses a small 3 x 3 filter sizes with 
stride 1. The model is trained for two to three weeks using 4 GPUs. This architecture 
is considered as the best model for extracting features from images. It consistently 
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Fig. 5 VGGNet architecture, a, b 16-19 layers VGGNet, c 8 layers AlexNet [15] 


uses 3 x 3 as filter size and 2 x 2 as pooling size. The convolution was performed 
with stride | and padding | and the pooling was with stride 2. It was observed that the 
spatial outline of the output volume is always preserved when 3 x3 filter is applied 
with a padding of 1, whereas the pooling process compresses the spatial footprint at 
all times. Hence, the pooling is performed on the non-overlapping spatial regions, and 
this always reduced the spatial footprint by a factor of 2. This architecture is widely 
used as a source feature extractor in various applications. The hyper parameter setting 
of this architecture is publicly available but still it is considered as a challenging 
architecture due to its usage of 138 million parameters. 
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GoogLeNet 

The winner of the ILSVRC 2014 competition is the GoogLeNet architecture [16] 
which achieved a topmost-5 error rate of 6.67%. The architecture is inspired by the 
LeNet architecture. The novel element included in this GoogLeNet is the inception 
module. The concepts like image distortions, batch normalization, and RMSprop are 
used in this architecture [17]. The inception module is developed over a number of 
very small convolutions to extremely reduce the number of parameters used. A total 
of 22 deep layers were used in CNN and the parameters are reduced to 4 million from 
60 million. The inception model is considered as the central part of the architecture. 
Figure 6. depicts an example of inception module and also depicts the design of good 
local network topology where the inception modules are stacked on top of each other. 


ResNet 

ResNet the winner of ILSVRC 2015 [18] trained the network with 152 layers and 
proved to be having less complexity than VGGNet. The architecture is unique on 
its own by means of utilizing “skip connections” and also features substantial batch 
normalization. It achieved a top-5 error rate of 3.57% which was considered as a 
superior performance than human level predictions on ImageNet data set [19]. The 
basic unit of this architecture is the residual model which plays a major role in 
developing whole network by assembling many such residual models (Fig. 7). 


Recurrent Neural Network 

Recurrent Neural Network (RNN) is commonly used along with CNN to employ 
the concept of recurrence, which is basically using the information from a previous 
forward pass of the neural network. Figure 8 depicts a simple RNN having a single, 
self-connected hidden layer. RNNs are more applicable for the applications having 
input as a sequence. Corresponding to the input sequence, RNN produce either a 
sequence of outputs or just one output for the entire input sequence. The key concept 
of RNN [20] is held by the recurrent connections which allow the memory of the 
previous inputs to carry further in the networks internal state and thus influencing 
the networks output. There are several variants in using the recurrence relationships. 
In the first variant, the hidden state for an entity is computed using its corresponding 
input entity and the previous hidden state. The output of the network is computed 
using the previous hidden states. The activations functions like tanh are used for the 
computation of hidden state and softmax functions are used to compute the output 
of the network. In the second variant of RNN, the hidden state for an entity in 
the sequence is computed using its corresponding input entity and previous output 
whereas in the first case it was using the previous hidden state. In the case of RNN 
producing single output, the computation of hidden state is done for each entity in 
the input sequence and the output is computed using the last hidden state. 

In another variant named Bidirectional RNN [22] in the computation of hidden 
state, the previous entities information along with the entities that lie further in the 
sequence are also considered which is not the procedure followed in unidirectional 
RNWNs. Hence, Bidirectional RNNs [Schuster] have both forward hidden state and 
back ward hidden states. The training of RNN is generally performed by applying a 
simple unroll operation on the RNN for a given size of input and then training the 
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Fig. 6 Full GoogLeNet architecture (Stack inception modules with dimension reduction placed on 
top of each other to form GoogLeNet) [16] 
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Fig. 6 (continued) same) 





Deep Learning for Person Re-identification in Surveillance Videos 273 


| relu 


3 
identity 





WOOO 
ALI) te" 
oe es 


1 
7 a 
tT 
=, 7 
as :* 
=. 
t 
aie 
- 7 
>is b } 






f = 
= 

= 1] 
f F 





Output Layer 


Hidden Layer 


Input Layer 





Fig. 8 A Recurrent neural network [21] 


RNN by computing the gradients and using stochastic gradient descent like technique. 
When the network is unrolled, each of the input state, hidden state (previous and next) 
and the output state correspond to a shallow transformation where the transformation 
is represented as a single layer with a deep multilayer perceptron network. 

To overcome the problem of vanishing gradients, a variant of RNN termed as 
Long Short-Term Memory (LSTM) is proposed. This architecture excels in learning 
long range sequences and avoids the long term dependency problem [23, 24]. The 
primary inspiring part of LSTM model is the use of a novel structure called memory 
cell which consists of four key components namely, an input gate, a neuron with a 
self-recurrent connection, a forget gate and an output gate. The input gate permits 
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Fig. 9 Representation of LSTM memory cell [23] 


the incoming signal to change the state of the memory cell or to block it. The self- 
recurrent connections are assigned with a weight of 1.0 and it guarantees that the 
position of a memory cell remains stable from one time step to another. The gates in 
this model are used to control the interactions between the memory cell itself and the 
environment. The forget gate regulates the self-recurrent connections of the memory 
cell by allowing the cells to recollect or forget its previous state. Finally, the output 
gate allows the state of the memory cell to create an impact on other neurons or 
interrupt it (Fig. 9). 

Figure 10 depicts the LSTM memory block with a single cell. Most commonly 
used gate activation function ‘f’ is the logistic sigmoid and hence the activations are 
bound to lie between O and 1. O denotes the gate is closed and 1 denotes the gate 
is open. ‘tanh’ or logistic sigmoid functions are generally used for cell input and 
output activation functions. However in some cases identity function is also used 
as activation function. The dashed lines in the figure denote the weighted peephole 
connections and the remaining connections in the block are unweighted meaning 
they have fixed weight of 1.0 [23]. The LSTM network is similar to standard RNN, 
however the summation units in the hidden layer are replaced by the memory blcoks. 


Siamese Neural Networks 

Siamese Neural Network [25] comprises of two or more alike or identical sub net- 
works. The work identical sub network means that they share the same architecture, 
same parameters and weights. Figure 11 shows the siamese network having the 
same weights between the networks. Based on the number of sub networks used, 
the architecture can be termed as pairwise or triplets and accordingly corresponding 
loss functions are employed. This network is appropriate for person re-identification 
problem as the output of the network is a similarity score at the top of the network. 
The network also addresses the data scarcity problem and achieves good recognition 
rate. 


Activation Functions for Deep Learning Models 

The activation function are the crucial part of training deep neural networks. Acti- 
vation function makes the network more powerful so as to learn complex data and 
represent the non-linear functional mappings between inputs and outputs. Another 
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important feature of an activation function is that it should be differentiable in order 
to perform back propagation optimization strategy. 


Sigmoid 
Sigmoid activation function is a *‘S’ shaped curve that ranges between O and 1. It is 
defined as shown in Eq. 1. 


1 


YY => —__ 
l+e~* 


(1) 


Vanishing gradient is a popular issue faced by sigmoid activation functions and this 
issue 1s More severe in deep architectures. Moreover, sigmoid activation function is 
not zero centered. Despite these issues, sigmoid functions are most widely in many 
classification tasks. 


Tanh 
Hyperbolic Tangent (Tanh) activation function resolves the issue of zero centered in 
sigmoid function. It ranges between —1 to +1. The activation function is defined in 
Eq. 2. 


e* —e* 
Y = ——— 2 
Cc =e = 2) 
Optimization is achieved easily in tanh activation function since it is zero centered. 
Vanishing gradient problem of sigmoid function still exist in tanh function. Tanh is 
mostly used in LSTM 


ReLU 

Rectified Linear Units (ReLU) emerged as popular activation function in recent 
years. It is proven to achieve six times improvement in converging when compared 
to Tanh function. ReLU is defined in Eq. 3. 


Y = max(0, x) (3) 


ReLU [13] is very simple, efficient, and avoids vanishing gradient problem, it is 
widely used in very deep architectures. However, ReLu2 activation function suffers 
due to dying ReLU problem where the excessive gradient flowing over a ReLu neuron 
might affect the weight update in such a way that the neuron never gets activated on 
any data point. It is limited to use only in hidden layers of deep architectures. 


Leaky ReLU 
Leaky ReLu is a kind of solution to overcome the problem of “dying ReLU problem”. 
The function is designed in such a way that rather than the function being assigned 
with zero when x < 0, a leaky ReLU will assign a slight negative slope. The same is 
defined in Eq. 4. 
y= {o* x <0 (4) 
x, x >0 
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a value in Leaky ReLU is 0.01. Though leaky ReLU provides good results in few 
cases it doesn’t exhibit consistency at all times. 


Parametric ReLU 

Parametric ReLU (PReLU) adaptively learns the parameters of the rectifiers [18], and 
improves accuracy at negligible extra computational cost. The difference between 
parametric and leaky ReLu is that leaky ReLu uses a predetermined whereas the 
parametric ReLu adaptively learns the parameter value from the neural network 
itself. PReLU is defined as given in Eq. 5. 


ax,x <0 
Y= , 5 
are ©) 


Maxout 
ReLU and its leaky version are together generalized in Maxout neuron [26] activation 
function. It has twice the number of parameters. The activation function is defined 
in Eq. 6. 


Y= max (W/' x + by, Wx + b) (6) 


where W,, W>2 are weight parameters and b,, b2 are bias. 


ELU 
Exponential Linear Unit (ELU) [27] 1s closely related to leaky ReLU. The function 
has a small slope for negative values and it uses a log curve instead of a straight line. 
The plus point of ReLU and leaky ReLU are incorporated in ELU. However, for 
huge negative values, it gets saturated and basically remains inactive. The function 
is defined in Eq. 7. 

y— | fx>0 (7) 

a(e* —1),ifx <0 


Loss Functions Used in Deep Learning Models 

The loss functions are primarily to calculate the error of the model. Different loss 
functions deliver different errors for the same prediction and it has a remarkable 
effect on the performance of the model. The various loss functions commonly used 
in person Re-identification problem are placed under two categories, pairwise loss 
functions and triplet loss functions. 


Pairwise Loss Functions 

To describe the pairwise models, consider X = {x,,%2,....,X,} and Y = 
{V1}, Y2,--+-+, Yn} aS a Set of person images and equivalent label for each person. 
To distinguish the matching from mismatched pairs y; and y; are compared and 
accordingly the input images are labeled as matched or mismatched [28] as defined 
in Eq. 8. 
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matching if yi = Yj (8) 


images (X;,x;) = 
8es(Xi,X;) pene if yi! =yj 


Hinge loss function mostly determines the maximum-margin classification. The 
function gives an output output zero when the distance similarity of the matching 
pairs is greater than the distance of the mismatching ones with respect to the margin 
‘m’. The hinge loss function is as defined in Eq. 9. 


1 


Y = ———_ 9 
Leaee ”) 


Cosine Similarity loss function improves or maximizes the cosine value for 
matching pairs and minimizes the cosine value for the negative pairs when the value 
is less than margin. The loss function is defined in Eq. (10) 


max (QO, cos(x1, x2) —m)if y= 1 


1 — cos(x], x2) ify=-l ey) 


image(Xx1,X2,y) = | 


Contrastive loss function [29] minimizes the mapping function to low dimen- 
sional space maps by mapping the similarity of input vectors as output and dissimi- 
larity as distant points. The loss is computed as given in Eq. (11) 


image(x1,X%2,y)= (1 —- y)5 (Dist) + (y)5 (max. m — Dist )}* (11) 


In Eq. (11) m is a margin parameter which is greater than zero and acting as a 
boundary. The distance between two feature vectors is computed as D(x,, x2) = 
|xj — x2||,. The average of total loss for each of the pairwise loss functions given 
above is computed as per Eq. (12) 


n 


] 
loss(X1, X2, Y) = ; image(x;, x), yi) (12) 
i i=] 


Triplet Loss Functions 

The triplet model consider as set of triplet images. Let image;, ima gee , image, be 
a set of triplet images where image; and ima gee are the images of the same person, 
and image; and image, refer to the images of different persons. The loss function 
for such models basically creates a margin between the distance metric of matching 
and mismatching pair and it achieves less distance between the matched pairs and 
mismatched pairs. Few triplet loss functions are given in this section. 

Euclidean distance is a commonly used distance metric in pattern recognition 
models. L2 distance metric is employed in some triplet loss functions denoted as 
dist(W, O;) where W = W; is the neural network parameter, and F,,(image;) 
denotes the network output of image 7. Equation (13) measures the distance between 
similar and dissimilar pair of a single triplet unit O; 
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dist(W, image;) = ||Fw (image;) — Fw(image; )\|’ — || Fw (imagei) — F (image; ||” 
(13) 


Hinge loss function aims to reduce the squared hinge loss of the linear SVM, 
which is same as determining the maximum margin based on the true person match 
and false person match over training step. This hinge loss function performs a convex 
approximation in the range of O-1 ranking error loss which basically approximates 
the models violation of ranking order specified in the triplet. The loss equation 
given in (14) has the margin parameter g. It is a regularization parameter which 
regularizes the margin between the distance of two image pairs (image;, image; ) 
and (image;, image, ). Dist is based on Euclidean distance. 


+ 


loss(image;, image; , image, ) = max(O, g + Dist(image;, image,’ ) — Dist(image;, image, )) 


(14) 


Equation (15) is an improved triplet loss function where N denotes number of triplet 
training examples, 6 is a weight factor to balance the inter-class and intra-class 
constraints. The function d (.,.) defines the L2-norm distance 


1 
loss (image;, image; , image, ,w) = _ S “(max {dist” (image;, image; image, ,w), 51}, 
+ B max {dist? (image;, image; , image, ), d_2} 


(15) 


Cross entropy loss or Softmax loss: This loss function is proposed by McLaugh- 
lin et al. [30] and the loss equation is as defined in Eq. (16). 


exp(W.v) 


>, &Xp(Wrv) ~ 


image(v) = P(y =clv) = 


In Eq. (16), v is the sequence feature vector, n is the number of identities, y is 
the identity of the person, W, and W; denote the cth and kth column of the softmax 
weight matrix W. 

Siamese cost proposed by Chung et al. [31] for both SpatialNet and TempoalNet 
architecture is defined in Eq. (17). 


2 


fir Fils Sy 
{max(m — | Fi. — fi Ot 1g 











ee i 
Dist fis fi.) = ; (17) 
2 








In Eq. 16, m denotes the Siamese margin and fic ei are the temporally pooled 
feature vectors for person i and j, respectively. 
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Binomial deviance loss function Wu et al. [32] in 2018, proposed to use cosine 
similarity and the binomial deviance loss function for training the neural network 
model. The loss function used is given in Eq. (18). 


loss = >» W © In(exp-?S PO” +1) (18) 
i,j 


In Eq. (18), © is the element wise multiplication operator, i and 7 denotes the 
count of training samples and S$ = [Sj j]nxn represents the similarity matrix for 
the image pairs having n representing the total number of training images. S; ; = 
cos ine(x;, x;)).a, B are the hyper parameters. The matrix M is to encode the training 
supervision and is defined as 


ve 1, matching pair 
~ | —1, mismatching pair 


W represents a weight matrix and is defined as 


oe mismatching pair 


1 : ; 
—; matching pair 
Wij — ny 


n, and nz are the number of matching and mismatching pairs. 


3 Person Re-identification Datasets 


The person re-identification datasets based on image and video that are used in 
literature are given in Table 1. 


4 Deep Learning Architectures for Person Re-identification 


Different deep learning models used for person Re-Identification are given in Table 2. 
The details provided are in terms of the architectural style used, activation functions, 
loss functions and the corresponding re-identification datasets on which the archi- 
tecture was employed. 


Evaluation Metrics 

The evaluation of the person re-identification models is generally performed using 
the Cumulative Matching Characteristic (CMC) curve, Synthetic reacquisition rate, 
and normalized Area under Curve (nAUC). CMC curves are used to evaluate the 
person re-identification task as a ranking problem [102]. The curve generated is 
based on the probability of identifying the correct match over the first k ranks. This 
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measure can also be termed as recall at k. The curve plot is the probability of correct 
match that is ranked equal to or less than a particular threshold against the size of 
the gallery set. The two aspects of the CMC curve are the first rank re-identification 
rate and the steep of the curve. The steeper the curve is the better the performance 
is. Secondly, Synthetic Reacquisition Rate (SRR) curve is derived from the CMC 
curve and it measures the probability that any of the k best matches is correct. The 
Normalized area under the CMC curve (nAUC) provides the overall performance 
by having the model yielding a positive match over a negative match. Higher the 
value of nAUC 1s, the better the performance would be. The main objective of all the 
re-identification models is to improve on Rank-1 recognition rates. 


5 Experimental Setup 


The Market 1501 dataset consists of 32,668 annotated bounding boxes and 1501 
identities captured by 6 cameras, 5 of which are high resolution and | is low resolu- 
tion. Each identity or person appears in at least 2 cameras. It is the largest and most 
robust open source re identification dataset available online. 

The dataset employs the Deformable part model in order to detect pedestrians in 
the images. For each detected bounding box, a hand drawn ground truth bounding 
box is created and the intersection over union is calculated. If the IoU value is greater 
than 50%, the bounding box generated is marked as good, if it is over 20% then it is 
marked as distractor, else otherwise it is marked as junk. 

The setup contains a base, pre trained CNN as an embedding network to produce 
vector embeddings of the Images in n-Dimensional space. In our experiments, Resnet 
and Xception networks are used as the CNN to extract said feature vectors which are 
pretrained on the ImageNet dataset (Fig. 12). 

The current model focuses on embedding the images into an n-Dimensional vector 
space, a process essential to achieve re identification. Once this embedding network 
is trained it can be fed validation images which will be mapped into the vector space 
such that the vectors representation of the same person starts forming clusters. Then 
these clusters can be extracted using clustering algorithms like K-Means clustering 
in order to achieve a complete end to end re-identification system. This experimental 
setup considers only the first half of the Re-id process involving the embedding of 
the images into the vector space. 

The first experimental set up consists of a residual network with 50 layers, used 
as the embedding network. The images are fed to the ResNet to obtain embedding’s 
of the images. The obtained results are then passed to a global average pooling layer 
to reduce it into a one-dimensional vector. Then, online hard mining is carried out 
to mine the hardest triplet in each batch. These triplet vectors are used to calculate 
the triplet loss as 


L = max{d(a, p) — d(a,n) + margin, 0] (19) 
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Fig. 12 Architecture used 
for vector mapping of images 
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where d (x, y) 1s the distance between the embedding’s of x and y, a is the anchor 
image, p is the positive image and n is the negative image and margin is a tun- 
able hyper parameter. The network has a total of 23,587,712 parameters of which 
23,534,592 are trainable and 53,120 are non-trainable. 

The second experiment used Xception networks with modified depth wise sepa- 
rable convolutions as the embedding network. The architecture has 36 convolutional 
layers which form the base for feature extraction. The embedding’s produced are 
passed on to a global average pooling layer and the vectors so produced are mined 
for hard triplets which are then used to calculate triplet loss. The network has a 
total of 20,861,480 parameters of which 20,806,952 are trainable and 54,528 are 
non-trainable. 

Resnet 50 and Xception networks are currently the highest performing networks 
on the ImageNet dataset and hence are used as feature extractors to embed the image 
dataset into an n-Dimensional space. We use pretrained models, with ImageNet 
weights for the embedding network. 

Adadelta with the learning rate set to 1.0 is used as the optimizer with the param- 
eters like rho (decay factor) and the decay set to 0. A network trained using this 
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Fig. 13 Loss graph for architecture with Resnet50 for embedding network 


method can be used to produce image vectors that can then be passed through a 
clustering algorithm to achieve re identification. 

The experiment was carried out on a Linux system with 16 GB RAM, Core 17- 
8700 k processor and an NVIDIA Titan Xp graphics card with 12 GBs of VRAM. 

The obtained results are summarized in the form of graphs which contain the 
number of epochs on the x-axis and loss value on training and validation given in 
y-axis. 

In the first experiment, the features extracted from ResNet50, when used to train 
the model were unable to converge to a satisfactory degree after running for 300 
epochs. The minimum validation loss obtained was in the initial phases of the training 
and was of the magnitude 147.2. The loss then proceeds to diverge despite using 
smaller lower learning rates and also while using other optimizers (Fig. 13). 

Subsequently in the second experiment, the features extracted using Xception 
networks, when used in the architecture described before were able to converge in 
100 epochs to about 80.6 without overfitting the training data (Fig. 14). 

Comparing the two, we see that a network trained with an Xception Network as 
the embedding network performs better than a network that uses a Resnet50 for the 
embedding network. 


Advantages of Deep Learning Models Towards Person Re-identification 


1. Deep learning models attempts to learn the high level features in incremental 
manner. 

2. Automatic feature learning eliminates the need of domain expert and the need 
for hard crafted features in person Re-identification 

3. During both training and testing time, generally deep learning algorithms works 
faster. 
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Fig. 14 Loss graph for architecture with Xception net for embedding network 


Limitations of Deep Learning Techniques Towards Person Re-identification 


1. Having less number of training samples is a major issue in person Re- 
identification problem. Need large datasets to develop robust models which can 
handle pose and viewpoint variations in the images. 

2. To train the deep learning models with in a reasonable amount of time, a high 
end infrastructure is required. 

3. The processing time is neglected in most of the works done. Hence, we require 
minimum size architectures with good recognition rates. 

4. Need some more research work to be carried out to get 100% recognition rates 
and to deal with anomalies. 


6 Conclusions and Future Work 


Person Re-identification is considered to be a challenging task in the CCTV surveil- 
lance system. Though the machine learning techniques have been recognized as good 
performers for person re-identification, deep learning technique play a vital role for 
this problem as it reduces much of human interference and the recognition rate is 
high in deep convolutional neural networks. Despite its high importance, there exists 
lot of issues in implementing the system in real world scenario. The challenging task 
would be the architecture size in terms of number of parameters and layers without 
affecting the recognition rate. The future work would be to focus more on scalable 
and end to end re-identification systems which can work in real time scenarios. 
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Abstract Human motion is an important spatio-temporal pattern as it can be a 
powerful indicator of human well-being and identity. In particular, human gait offers 
a unique motion pattern of an individual. Gait refers to the study of locomotion in both 
humans and animals. It involves the coordination of several parts of the human body: 
the brain, the spinal cord, the nerves, muscles, bones, and also joints. Gait analysis has 
been studied for a variety of applications including healthcare, biometrics, sports, and 
many others. Until recently, the analysis has been done mainly by human observation, 
using parameters and features established in existing practice and therefore limited by 
the nature of measurements captured by the gait sensing modalities. In this chapter, 
we reviewed key conceptual and algorithmic facets of deep learning applied to gait 
analysis in two important contexts: security and healthcare. 
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1 Introduction 


This book chapter focused on understanding human motion behavior by modern 
machine learning techniques to solve complex, but interesting problems in the areas 
of security and healthcare. Specifically, the interest was to study human gait and 
footsteps, a periodic motion in time and space, and how this unique behavioral 
pattern can provide insights into applications such as detecting perturbation of gait 
under a cognitive task for wellbeing and biometric verification of individuals for 
security. 

Gait, enables human motion, involves the coordinated interaction of many parts 
of the human body [1]. Consequently, gait is unique for every human being and 
is influenced by independent factors such as height, gender, fitness, and age. Gait 
can be expressed as a spatio-temporal pattern and offers value for a wide range 
of applications, ranging from biometric systems to the identification of markers 
of neurodegenerative diseases, such as Alzheimer’s [2], based on long-term gait 
monitoring. Typically, both temporal and spatial parameters of gait, such as cadence, 
stride length, and others are analyzed from data acquired by a suitable floor sensor 
system or fused from several systems. 

The first research theme of this book chapter focused on gait analysis for healthcare 
in the context of dual tasks. Gait has been shown to be affected by a managed cognitive 
load [3-6]. Walking while simultaneously performing a cognitive task, also known 
as ’dual-task’ in literature, has been shown to induce walking variability in adults 
[3-5]. Moreover, in older adults, walking whilst performing a secondary task such 
as talking has been shown to increase the risk of falling [7]. Dual-task research aims 
to understand the relationship between cognitive activities and gait. 

The negative impact of dual-tasks in participants with the neurodegenerative dis- 
ease may be greater than in cognitively healthy adults [6]. This finding suggests that 
the cognitive capacity that each participant brings to the walking task may play an 
important role in the walking patterns. This finding has implications for finding a 
possible gait-related behavioral marker or ’biosignature’ indicating the early stages 
of neurodegenerative diseases. 

The second research theme of this book chapter was focused on a verification 
biometric problem applied to security [8]. Biometrics is an area that deals with the 
design of security systems for automatic identification or verification [9] of a human 
subject (client) based on physical and behavioral characteristics. Physical biometric 
traits include fingerprints, facial features, and iris. Behavioral biometrics, such as gait 
recognition, are intended to capture unique signatures delivered by client’s natural 
behavioral patterns. This approach is useful since the complexity in reproducing such 
patterns by an impostor (intruder) is difficult. Biometric recognition by gait is based 
on the study of human locomotion to obtain a distinct biometric signature of a client. 
Twenty four unique factors have been shown to affect human gait [10], resulting in 
a singular gait pattern for every individual. 

Moreover, a biometric system based on gait requires users to exert minimum 
effort from users for appraisal. A gait biometric system can be deployed in scenarios, 


Deep Learning in Gait Analysis for Security and Healthcare 301 


ranging from airport entry checkpoints and entry to buildings to home-based security 
systems. Feature engineering has been central in automatic gait recognition research 
[11]. The procedure involves the careful selection and design of complex and time- 
consuming hand-crafted features from footstep data, employing geometric, holistic, 
spectral and wavelet feature engineering approaches to name some [12]. 

For both themes, the research effort of this book chapter focused on designing 
machine learning models based on Convolutional Neural Networks (CNN), a form 
of deep learning [13], to allow the automatic extraction of features from the raw 
Spatio-temporal gait and footstep data. 

The ImageNet Large Scale Visual Recognition Challenge [14] is one of the largest 
computer vision competitions in the world. The challenge objective is to classify 
images from a 1000 set of possible labels such as “car”, “plane’’, etc. The dataset 
contains around | million images. The breakthrough of deep learning in modern time 
came from using this massive dataset for image classification by using convolutional 
neural networks. The best accuracy results of the challenge in recent years (from 
2014 onwards) have used convolutional neural network techniques at its core [14]. 


2 Gait Analysis Review 


Gait analysis has been widely studied for a variety of applications including health- 
care, biometrics, sports, and many more [1]. Classification of a person’s given its 
emotional state has also been explored. A person’s pride, happiness, neutral emotion, 
fear, and anger has been classified with high statistical confidence given only its gait 
pattern [15]. Generally, three types of gait monitoring systems exist, namely: cameras 
using image processing, floor sensors and wearable sensors [11]. The use of cameras 
for gait is vulnerable to details in the environment such as levels of lighting. Besides 
that, the use of cameras is considered an invasion of privacy in living environments, 
e.g. for healthcare [16]. Because of disadvantageous parallels to video surveillance. 
The disadvantage of wearable sensors is that the sensors need to be attached to the 
body, maybe uncomfortable to wear, as well as require assistance to attach correctly. 
On the other hand, floor sensor systems have the advantage of being non-invasive and 
even unobtrusive, less prone to environmental noise and undemanding the subject’s 
attention, which affects the data quality positively. 

Gait occurs due to a cooperation of several parts of the human body including the 
brain, spinal cord, nerves, muscles, bones, and joints [1]. Within a walking sequence, 
gait can be understood as a translation of human brain activity to the patterns of muscle 
contractions. The command is generated by the human brain which is transmitted 
to initiate the neural centers through the spinal cord which eventually results in 
patterns of muscle contractions supported by the feedback from muscles, joints, and 
the receptors. This will results in the movement of the trunk and lower limbs in a 
connected way whilst the feet recursively touching ground surface and the change 
center-of-mass of the human body. Gait can be defined as repetitive cycles for each 
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foot resulting in a sequence of periodic events. Each cycle can be divided into stance 
and swing phases as shown in Fig. 1. 

The classification of a person’s given its emotional state has been explored in 
the literature. A person’s pride, happiness, neutral emotion, fear, and anger has been 
classified with high statistical confidence given only its gait pattern [15]. Generally, 
three types of gait monitoring systems exist, namely: cameras using image process- 
ing, floor sensors and wearable sensors [11]. The use of cameras for gait is vulnerable 
to the environment such as lighting. Besides that, the use of cameras invades the pri- 
vacy in living environments, e.g. for healthcare [16] because of disadvantageous 
parallels to video surveillance. The disadvantage of wearable sensors 1s that the sen- 
sors need to be attached to the body, maybe uncomfortable to wear, as well as require 
assistance to attach them correctly. On the other hand, floor sensor systems have the 
advantage of being non-invasive and even unobtrusive, less prone to environmental 
noise and undemanding the subject’s attention, which has a positive effect on data 
quality. 

Within a walking sequence, gait can be understood as a translation of human 
brain activity, projected into the spinal cord and then able to activate the patterns 
of muscle contractions. The command is generated by the human brain which is 
transmitted to initiate the neural centers through the spinal cord which eventually 
results in patterns of muscle contractions supported by the feedback from muscles, 
joints, and receptors. This results in movement of the trunk and lower limbs in a 
connected way whilst the feet recursively touch the ground surface and the change 
center-of-mass of the human body. Gait can be defined as repetitive cycles for each 
foot resulting in a sequence of periodic events. Each cycle can be divided in phases 
shown in Fig. 1, defined as follows: 


e Stance Phase (approximately 60% of the gait cycle, with the foot in contact with 
the ground). This phase is subdivided into four intervals (A, B, C, D). 


Fig. 1 Gait cycle [17] A, f 
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Fig. 2 Modalities of gait Fr = 
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e Swing Phase (approximately 40% of the gait cycle with the foot swinging and not 
in contact with the ground). This phase is subdivided into three intervals (E, F, G). 


The three main modalities to study gait is by image processing, floor sensors and 
with wearable devices. The modalities are shown in Fig.2 [18]. Gait patterns can 
be obtained from video streams, floor sensor systems footstep pressure or with 
accelerometer signals (temporal signals). 

Gait analysis in the context of this work deals with two main components. One 
component is time: this refers to the temporal gait cycle pattern. The other component 
is space: this refers to the spatial footstep shape characteristics of the gait pattern. 
Here, we introduce a methodology to learn spatio-temporal features directly from 
raw sensor data with deep learning models, this is without the use of human feature 
engineering. The deep learning models are based on ANN architectures of several 
layers that are able to learn features automatically from raw sensor data. 


2.1 Non-wearable Sensors 


Camera-based sensors. Here images and videos obtained from cameras record 
human gait. Then, image processing techniques are used, such as segmentation and 
others to identify gait from the images. This is the most widely used approach in the 
literature. This approach involves both model-free and model-based analysis for gait 
recognition [12]. 

In Table 1 are shown the state-of-the-art research approaches for vision systems 
in biometrics applications. The performance is indicated as classification rate (CR). 
Histogram-based systems have been found to be the most successful in this area. 


Floor sensors. The main research themes developed in this work are based on floor 
sensor systems. There are mainly two types of floor sensors studied in the literature 
based on the ground reacting force and the other based on switch sensors. The first 
obtains continuous signals while the latter only delivers binary pressure signals. In 
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Table 1 State-of-the-art vision systems for gait analysis in biometrics 


Feature set Subjects 
Histogram and silhouette [19] 3141 
Histogram and deep silhouette [20] 176 
Histogram and energy image [21] 122 
Chrono-gait image [22] 122 
95S%CR 


Gait energy volumes [23] 95% CR 25 


Table 2 State-of-the-art floor sensor systems for gait analysis in biometrics 


Research group | Signals | Users Samples | Model Multi-shoe | Norm. | Results 
(EER) 
Cattin [26] 470 16 6 step Euclidean Yes No 9.45 
cycles distance 
Stevenson 88 85 step | HMM Yes Yes 20 
et al. [27] cycles 


Mostayed et al. 18 5 step Histogram | No No 3.3 to 16 
[28] cycles similarity 

Vera-Rodrigez | 9900 5 500 step | SVM Yes Yes 2.6 val, 
et al. [29] cycles 4 eval 
Mason et al. 399 10 30 step | LDA Yes Yes 1.52 val, 
[12] cycles 3.1 eval 
Costillla Reyes | 9900 127 500 step | Resnet and | Yes Yes 0.7 val, 
et al. [8] cycles SVM 1.70 eval 


Table 2 are shown the state-of-the-art footstep recognition systems. The table shows 
the number of signals, the number of users, samples type per model and performance 
results in EER. 

The iMagiMat [24, 25] is an affordable floor sensor that allows spatio-temporal 
sampling of the ground reaction force (GRF) resulting from footsteps. The gait data 
can be recorded, stored and analyzed over large periods of time. The technology 
is embodied in a 1m by 2m prototype [24]. Gait is measured by detecting light 
attenuation caused by the bending of plastic optical fibers (POFs) while walking 
on the surface. GRF in the active area of the sensor can then be reconstructed for 
further data analysis [24, 25]. The adequate spatio-temporal sampling is ensured by 
applying tomography principles to the floor sensor design and a suitable frequency 
of spatial frames acquisition set at 256 Hz. 


2.1.1 Advantages 


e Multiple gait parameters can be obtained from a wide set of modalities 
e There are no power constraints limitations 
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e Non-intrusive system 
e There are no external factors affecting the analysis due to a controlled environment. 


2.1.2 Disadvantages 


Area of measurement is confined to a limited space 

Sensor system is often expensive 

May allow gathering signals without the users’ consent or knowledge 
Not suitable for outdoor applications 


2.2 Wearable Sensors 


In this approach, sensors are placed in different positions of the human body to 
measure gait. The type of sensors that can be used are force sensors, accelerometers, 
gyroscopes, extensometers, and inclinometers. 


Inertial sensors. These sensors use the earth’s gravitational field to obtain measure- 
ments of a subject velocity, acceleration, orientation or gravitational forces for gait 
analysis. Three-axis accelerometers and gyroscopes angular velocity is usually used 
for this type of application. 

In Table 3 are shown the state-of-the-art approaches for inertial systems. There is 
currently no consensus from the research community in approach or location of the 
sensor for optimal analysis. 


Ultrasonic sensors. Sound waves are used as the sensing mechanism. By measuring 
the distance between the ultrasonic sensor and the progression of the gait pattern, 
gait can be measured and consequently studied. 


Electro goniometer. This sensor, often installed in the hip or knee allows the obten- 
tion of continuous measurements of the current states of a joint angle of a human 
subject. 


Table 3 State-of-the-art inertial systems for gait analysis in biometrics 


Feature set Location of 
EER % sensing 

Inner product of acceleration [30] 68 (740 | Pocket 

Key points of acceleration [31] 5 locations 

Gait cycle acceleration [32] 1600] OO Hip 

Direct matching [33] Ankle 

Dynamic time warping [34] Spine 
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Exoskeletons. They are devices that cover the entire human body, usually made 
of solid materials. These devices are combined with goniometers or potentiometers 
sensors to allow measurement of human kinematics. 


2.2.1 Advantages 


Long-term gait analysis 1s allowed 

System may be inexpensive 

Controlled environments are not necessary, therefore natural gait signals may be 
obtained 

Suitable for outdoor applications 

Freedom to focus the analysis in different locations of the human body 


2.2.2 Disadvantages 


e Power consumption limitations 
e Due to portability, only a limited set of gait parameters can be studied 
e susceptible to noise and interference of external environments. 


In summary, the floor sensor system offers unique advantages over other sensing 
modalities to analyze gait. The system is non-intrusive and resilient to noise in the 
environment as the main advantages over other sensing systems. As opposed to 
forcing the user to wear the device for the experiment as in some inertial systems or 
to be susceptible to noise in environmental conditions such as different levels of light 
or cross-view angles difficulties to acquire the data in a form suitable for analysis 
[35]. For the aforementioned reasons, this book chapter focuses on studying the 
ground reacting force from floor sensor systems for gait analysis in two applications 
healthcare and security. 


2.3 A Review of Floor Sensor Systems and Datasets for Gait 
Analysis 


Cameras, inertial sensors or floor sensor systems have been used for gait analysis [11, 
36]. Floor sensor systems have the advantage of being unobtrusive and resistant to 
surrounding noise; in contrast, camera systems require adequate illumination while 
wearable inertial sensors require daily placement and maintenance. A floor sensor 
system can be hidden in a home environment allowing the acquisition of natural gait 
signals over large periods of time. While floor sensor systems have been built for 
automatic gait analysis applications [11], they have relied heavily on physiologically 
defined, man-made and features such as the body’s center of pressure, stride length, 
and cadence, rather than using raw sensor signals, to construct gait classifier models. 
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An example of gait recognition system using a switch sensor system is the Ubi- 
FLoorl system [37]. The switches in the UbiFloorlIl system are made from photo 
interrupters sensors. The switch sensor generates 0 V or 5 V (on-off) according to 
the weight exerted on the floor sensor system. 

Force plates [38] have been used as the sensors used for gait analysis to obtain 
the ground reaction force, perpendicular to the floor sensor system. Piezoelectric 
sensors are used as the sensing mechanism. The piezoelectric effect measures the 
accumulated charge in solid materials as a response to force stress. In this case, 
the measured pressure is the response to the pressure exerted by the weight of the 
subject walking on the floor. The change in pressure modifies the voltage level in the 
piezoelectric sensor output, to enable the measurement of gait signals. 

For the goal of classification of human postural and gestural movements using 
floor sensor systems, Saripalle et al. [39] applied force platforms to infer the center of 
pressure of individuals. Eleven body movements by volunteers were analyzed with 
an accuracy ranging from 79 to 92% using linear and non-linear supervised machine 
learning models. Feature selection is highlighted as a critical step for obtaining reli- 
able accuracy scores, but this approach is limited by the lack of a single classification 
model suitable for all types of mobility. 

Floor sensors systems have been used to distinguish human movements as pre- 
sented in [40]. The recognition is achieved by analyzing the Ground Reaction Force 
(GRF) on a weight-sensitive floor. The changes in the GRF arise from activities per- 
formed at the same position, including jumping, sitting and rising. A hidden Markov 
model was used for human movement classification. The classification performance 
was close to 100%. One of the disadvantages of such a study is that the postural 
activities were performed statically at the same position. 


3 Deep Learning for Gait Analysis 


Supervised machine learning is a category of artificial intelligence (AI) and a specific 
kind of machine learning. Algorithms or mathematical models are built and trained 
with a given set of inputs and desired outputs. The models are tested on unseen 
data by exploring the structure of the data and fit into the models which can be 
understood and utilized by the users [41]. Shallow Learning depends on handcrafted 
features learned in a predefined relationship between the inputs to the output: such as 
linear regression, logistic regression, decision tree, Support Vector Machine (SVM), 
random forest, naive Bayes, and k-nearest neighbor. 

Supervised learning is the most widely used technique in the industry for classifi- 
cation and prediction. For example, Google Inc. uses supervised learning to classify 
the email as spam or not spam or to rank web pages for their search engine by using 
the page rank [42] algorithm, Facebook Inc. uses supervised learning to automatically 
tag people in pictures uploaded to the social site [43]. Amazon Inc. uses supervised 
learning as a recommender system [44] to buy products based on user history. Those 
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examples are just a small subset of the wide applications of supervised learning in 
industry. 

Deep structured learning or hierarchical learning is inspired by the biological 
neural networks structure and function. It is based initially on the concept of multi- 
layer Artificial Neural Network (ANN) with the aim to learn data representations 
automatically; thus, Deep Learning becomes the method of choice where the classi- 
fication features, if known at all, are complex, with no straightforward quantitative 
relation to the raw data. Typically, the term ‘deep’ refers to the number of layers 
in the variety of possible networks structures: Deep Belief Networks (DBN), Feed- 
forward Deep Networks (FDN), Boltzmann Machine (BM), Generative Adversarial 
Networks (GAN), Convolutional Neural Networks (CNN), Recurrent Neural Net- 
works (RNN), and Long-Short Term Memory (LSTM) a special kind of RNN. A 
comprehensive presentation of the theory of ANNs and Deep Learning is not within 
the scope of this review and the reader is referred to established sources [13]. Further, 
we focus on models with practical significance for gait applications such as CNN 
and LSTM. 

The CNN model is suitable for processing 1D, 2D or 3D data that has a known 
grid-like topology [13]. The network has the ability to learn a high level of abstrac- 
tion and features from large datasets by applying convolution operation to the input 
data. Commonly, the network consists of convolution layers, pooling layers, and 
normalization layers, with a set of filters and weights shared among these layers. 

The convolutional layers output a feature map harvested automatically from the 
raw input data. The pooling layers are utilized to reduce the size of representation and 
make the convolution layer spatially invariant. The CNN model uses commonly two 
types of pooling layers: max pooling and average pooling. All convolution layers and 
pooling layers have activation functions (e.g. Sigmoid, Tanh, ReLU, Leaky ReLU), 
to calculate the weight of neuron and add a bias, deciding whether to fire the neuron 
or not [45]. LSTM networks are favorable for processing time-series data, where 
the order is of importance, such as gait data sequences. In essence, they exploit 
recurrence, by using information from a previous forward pass over the network. 

The goal of using ANNs in gait analysis is to develop a model to extract gait 
features and perform well on unseen real-world gait data. Commonly, for appropriate 
training and testing, the model is trained and validated on 70% of the data and 
tested on the remaining 30%. In supervised training, the procedure is launched by 
initializing the weights randomly, processing the inputs and comparing the resultant 
output against the desired output. During training, the weights and biases are adjusted 
in every iteration, until the error is minimized, and validation is used to estimate the 
model performance during training. Lastly, the model is tested with unseen data, 
allowing to identify over-training. 

The widely used accuracy measure for ANN gait analysis is the confusion matrix. 
It is a table to visualize the number of predictions classified correctly and wrongly 
for each class. The table consists of true positive, true negative, false positive, and 
false-negative classification occurrences. One of the advantages of the confusion 
matrix display is that it is straightforward to identify the decision confusions, thus 
possibly concluding on the quality of the data involved. 
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3.1 Convolutional Neural Networks 


3.1.1 Convolutional Layer 


A core building block in CNN models is the convolutional Layer where the com- 
putational heavy lifting of data processing is taking place. This layer is based on 
Convolution, a specialized kind of linear operation [30]. The convolution operation 
is performed on two functions to produce a feature map, where the first function 
is the input data and the later is the filter or kernel. In this process, the filter slides 
over an input data and perform convolution, the sum of the convolution operations 
transformed to feature maps (see Fig.3). Feature map output consists of different 
feature maps produced by different kernels as convolution layer output. An activa- 
tion function is utilized to produce nonlinear feature maps to make the training faster 
and more accurate. The widely used activation function in a convolutional layer 
is Rectified Linear Units (ReLU) to convert all negative numbers to O or positive. 
A mathematical representation of convolution operation given an input /(t) and a 
kernel K (a) is given as 


s(t)= > aN I (a) * K(t —a) (1) 


3.1.2 Pooling Layer 


There are usually two types of layers in convolutional networks, pooling layers, max 
pooling, and average pooling. The objective of using this layer is to recombine the 
convolutional layer output to produce meaningful information. In pooling layer, a 
filter slides over the convolutional layer output and the maximum or average value 
in the filter window are transformed as an element in an output matrix as pooling 
layer output. 


Fig. 3. Convolutional layer 
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3.1.3 Training and Testing 


In gait analysis, the goal of using neural networks 1s to develop a model to extract gait 
features and perform well on unseen real-world gait data. For appropriate training 
and testing, commonly the model is trained and validated on 70-80% of the data and 
tested on the remaining 30-20%. In supervised training, the procedure initializes the 
weights randomly, processing the inputs and comparing the resultant output against 
the desired output. During training, the weights and biases are adjusted in every 
iteration until the error is minimized, and validation is used to give an estimate of 
the model performance during training. Lastly, the model is tested with unseen data, 
allowing to identify over-fitting. 


3.1.4 Convolutional Neural Networks for Spatio-Temporal Analysis 


Recognize human actions from videos is an important spatio-temporal problem [46]. 
In recent years, the top-performing models to solve this problem have been based on 
CNN’s and Recurrent Neural Networks [46-48]. 

The architectures use publicly available video datasets. These approaches are 
effective to learn representations from raw video frames. However, they are complex 
and require large computational resources to train; furthermore, in some cases, other 
pre-processing steps are required, such as to calculate optical flow between video 
frames [46]. Convolutional neural networks have been proposed to study spatio- 
temporal human recognition [49]. Including 3D convolution operations to capture 
both the spatial and temporal domain components [50]. Action scene understanding 
has also been proposed [51]. The two-stream convolutional network has shown to be 
effective for the spatio-temporal action recognition problems [46, 49]. 

The two-stream deep learning architecture [46, 49] utilizes an end-to-end learning 
approach for analyzing the spatial and temporal streams of videos in two separate 
deep networks. The spatio-temporal information is combined at a feature or score 
level after the last layers in the network. However, this approach sometimes involves 
computationally heavy calculations such as optical flow [46]. 

Gait analysis from the video (spatio-temporal features) has been widely studied 
in the literature [12, 35, 52]. Wu et al. [35] presented a study of cross-view gait 
for human identification, using deep convolutional neural networks models in three 
gait datasets. The results show a substantial increase in the average recognition rate 
performance when compared with the previous state-of-the-art. For example, in the 
CASIA-B dataset, the average recognition rate reaches 94.1% with a deep network. 
This compares favorably with the previous best recognition rate of 65% by hand- 
made feature engineering. 
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4 Deep Learning in Healthcare: A Case Study 
in Dual-Tasks 


We investigated cognitively demanding tasks on patterns of human gait in healthy 
adults with a deep learning methodology that learns from raw gait data. Age-related 
differences were analyzed in dual-tasks in a cohort of 69 cognitively healthy adults 
organized in stratified groups. A novel spatio-temporal deep learning methodology 
was introduced to effectively classify dual-tasks from spatio-temporal raw gait data, 
obtained from a tomography floor sensor. The approach outperformed traditional 
machine learning approaches. The most favorable classification F-score obtained was 
of 97.3% in dual-tasks in a young age group experiment. The deep machine learning 
methodology outperformed classical machine learning methodologies by 63.5% in 
the most favorable case. Finally, a 2D manifold representation was obtained from 
trained deep learning models’ data, to visualize and identify clusters from features 
learned by the deep learning models. Here, we demonstrate a novel approach to dual- 
task research by proposing a data-driven methodology with stratified age-groups. 


4.1 Aims and General Method 


This study aims to establish a benchmark in a relationship between a managed cog- 
nitive load and gait in cognitively healthy participants from a data-driven analysis 
perspective, in contrast to the traditional studies found in the gait literature, that take 
advantage of gait parameters such as gait speed variability, walking base and others 
[3—5, 53]. This study proposes a novel analytic approach to gait analysis based on 
advanced computational models known as deep machine learning [13]. 

Current methods in dual-task analysis rely on specific statistical features such as 
gait speed and variability [3—5, 53]. These studies rely on a limited number of features 
and parameters to make inferences about the gait patterns. Here, we use a data- 
driven approach to learn the parameters. Moreover, sometimes few experimental gait 
samples per participant have been included in the analysis in traditional approaches. 

In this study, the dual-task effects are studied using deep machine learning princi- 
ples [13] to automatically define and extract most favorable features harvested from 
raw spatio-temporal gait data and selected by the model. The data were obtained 
from an original tomographic floor sensor system base on sensing enabled by plas- 
tic optical fibers [25] sampled directly from the raw sensor data rather than from 
reconstructed data [24], which requires further processing. 

A large cohort of 69 participants participated in the study. They performed the 
tasks for 5 min. The approach resulted in a sizeable set of gait samples per participant 
experiment that allowed statistical reliability [54]. Furthermore, large datasets are 
beneficial for the most favorable application of deep learning models. 
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4.2 Background 


In dual-tasks, a participant performs a cognitive task whilst walking [4] to mea- 
sure the effect of the cognitive task on walking patterns. Studies have shown that 
stride velocity changes and gait variability increases during the performance of a 
cognitively demanding task [3-5]. During dual-tasks a participant’s executive func- 
tion, postural control and the ability to walk are altered. Dual-tasks have caused 
pronounced changes in gait from mild-cognitive impaired (MCI) participants when 
compared to healthy control groups [53]. 

Several factors may influence changes in gait while performing cognitive tasks. 
For example, the impact of a cognitive task on walking speed has been linked to 
the difficulty of the task and on the nature of the walk. Other factors that might 
influence dual-tasks include anxiety, happiness, and other emotional states, which 
have received less attention in the gait analysis literature [55]. Moreover, a clear 
consensus on the effect of cognitive load on walking patterns has yet to emerge in 
the literature [4]. 


4.3 Methodology 


4.3.1 Inclusion Criteria 


Healthy men and women between the ages of 20 and 65 years were invited to par- 
ticipate in the study. Those with any condition that might affect a normal walking 
pattern, typically a history of falls within 6 months prior to enrolment, were excluded 
from the study. Statistical information such as gender and age were also captured 
to allow further analyses. All methods were performed in accordance with guide- 
lines and regulations by the University of Manchester Research Ethics’ Committee. 
Informed consent was also obtained from the participants to take part in this study. 

Statistical information such as gender and age were also captured to allow further 
analyses. All methods were performed in accordance with guidelines and regula- 
tions by the University Research Ethics Committee (UREC) at the University of 
Manchester. The experimental protocol was approved by the Ethics’ Committee 
with reference: ethics/15536 on January 25, 2016. Informed consent was obtained 
from the participants to take part in this study. 


4.3.2 Procedure 


Four walking experiments were executed by each participant on the floor sensor 
system. The participants initially undertook normal and fast walk experiments. Fol- 
lowed by two dual-task experiments. The first dual-task experiment was to spell five 
common words in reverse [3]. In the second, participants performed serial seven 
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subtractions starting from a random 3-digit number [5]. The experiments were per- 
formed in a silent environment, with external distraction kept to a minimum. Par- 
ticipants were allowed to wear any type of footwear during the experiments. Each 
experiment lasted 5 min for a total of 20 min per participant. The number of cap- 
tured experimental gait samples depended on the participant’s speed and manner 
of walking, which varied among participants. The participants walked continuously 
from one end of the walkway to the other during the experiment. An extra one-meter 
length at the start and end of the floor sensor system was allowed to enable the par- 
ticipants to accelerate and decelerate their walk. No cameras or video recording was 
used since they can significantly compromise the privacy of participants [16] and 
affect adversely the quality of the data. 


4.3.3. Description of Experiments 


1. Normal walk experimental task: Participants walked at normal self-selected speed 
for the duration of the experiment. 

2. Fast walk experimental task: Participants walked at self-selected fast walking 
speed. 

3. Dual-task one, reverse spelling experimental task: Participants were given a set of 
five-letter common words and they had to spell the word continuously backward 
out loud [3] for the duration of the experiment. For example, spell “earth” or 
“could”. 

4. Dual-task two, backward serial subtraction experimental task: Participants sub- 
tracted seven from a random three-digit number continuously out loud during the 
five minutes of the experiment [5]. 


4.3.4 Database: UoM-Gait-69 


The dual-task database collected, entitled UoM-Gait-69, was comprised of data from 
69 cognitively and physically healthy adults who participated in our study. The 
participant’s ages ranged from 20 to 63 years. Thirty-seven (53%) were female. The 
participants were given a unique identification number (ID) for anonymization and 
experiment identification. 


4.4 Experiments for Age-Related Classification 


We designed a set of seven experimental cases (rows of Table4). The cases are the 
database volunteers arranged in different groups to allow experimental results. For 
example, Experimental task 1 has 3 groups: group 1 of 27 participants between 
20-28 years, group 2 of 22 participants between 31—42 years and group 3 of 20 
participants between 46—63 years. Experimental cases Table 4. 
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The experimental cases have a wide range of age-cohort sets (group columns of 
Table 4) of age-related differences in dual-tasks executed by the participants with 
the aim of testing the ability of a machine learning model to differentiate age-range 
sets with a variable number of participants per set. This includes experiment type, 
number of participants and age range. In some experiments, a large age cohort of 
participants was contained in each group. For example, experiment one had three- 
decade-long age sets for classification of approximately 20 participants 1n each set. 
Also, a single age cohort was included in other experiments such as experiment 
seven, for participants between 20 and 26 years old of age. 


Spatio-Temporal Raw Sensor Matrices 


Spatio-temporal raw sensor matrices (RSMs) as described in [25] were constructed 
from the raw sensor data in this study. This approach did not require tomography 
reconstructed images, instead, it was possible to derive the RSMs directly from the 
raw data. Therefore, RSMs were calculated for all the experiments performed for 
this study. 


4.5 Spatio-Temporal Deep Learning Model 


The deep learning inception architecture, shown in Fig.4, contains two-stream 
Inception-like modules that model space and time within the same deep network. One 
of the streams was assigned to the temporal domain whilst the other to the spatial 
domain. The spatio-temporal streams were optimized at the same time by back- 
propagation [13]. Each stream ends before a fully connected layer of 100 neurons 
to allow equilibrated spatio-temporal feature concatenation. Then the outputs of the 
network were passed through a last fully connected layer with softmax activation that 
allowed gait sample classification. After this layer, classification performance results 
were obtained according to the experiment type (see Table 4). ReLU activations are 
used in all layers [13]. 


Table 4 Description of 7 experimental tasks 
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Fig. 4 Deep learning 
spatio-temporal architecture 
(inception inspired). 
Temporal and spatial streams 
are concatenated at the 
feature level before a last 
fully connected layer to 
obtain classification scores 


4.6 Results for Age-Related Differences 



























































Classification experiments were performed in three age groups organized in 7 exper- 
iments. As shown in Table 4, age group one, the youngest age set, ranged from 20 to 
28 years. Age group two, the middle age set group, ranged from 31 to 42 years (with 
the exception of experiment seven, a 26-year-old only group). Age group three, the 
older age group, contained participants with age ranges between 46-63 years. 

The classification results are reported as a triple measure, containing F-score 
percentage performance for normal walk compared to (1) fast walk (2) dual-task one 
and (3) dual-task two. 

The highest classification performance model was obtained for experiment 7, 
whilst the lowest performance was obtained for experiment |. In the former two 
age groups containing a single age were classified while the latter had the largest 
number of participants distributed by age group. These characteristics influenced the 
classification performance. In Fig.5 is shown F-score performance summary of all 
the experiments. 

Experiment 7 contains 6 participants per single age (ages 20 and 26 years). While 
experiment 3 contains a large number of participants in wide age ranges. 27 partic- 
ipants between the ages of 20-28 years and 20 participants between 46-63 years. 
The wide age ranges found in experiment 3 offer a higher challenge to the classifi- 
cation algorithms to discriminate the gait samples correctly, this is in contrast with 
experiment 7 which contains a single age (Table 4). 


Model comparison 


Ensemble and linear machine learning models were also implemented to compare 
with the results presented in Sect.4.3. Table 5 demonstrates that the Random Forest 





Fig. 5 F-score performance 
summary of the seven 
age-related experiments. Y 
axis: F-score in % X axis: 
number of experiment 
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Table 5 Age-related differences classification results 


Experiment | Classification | Total Age Precision | Recall F-score | Support 
classes me (%) (%) (%) 
Normal and 81.21 74.53 72.34 7052 
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returned an F-score of 56.12% whilst the linear SVM returned the lowest classifica- 
tion performance overall with an F-score of 23.67%. The deep learning methodol- 
ogy (F-score: 97%) improved the F-score of the Random Forest classifier by 40.88%, 
while the best improvement was obtained against the linear SVM classifier by 63.5%. 
These results justify a conclusion of robust classification performance of the deep 
machine learning methodology compared to a shallow, ensemble and linear machine 
learning models (Table 6). 
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Table 6 Comparison of the deep learning models against shallow models of experiment seven. 
Classification of normal and dual-task two is shown. Classes are defined in Table 5 


Optimization Precision (%) | Recall (%) | F-score (%) | Support 


Early stop Two-stream 97 96.99 96.99 1197 
inception 

Genetic Gradient boosting | 62.79 62.82 62.50 1197 

programming classifer 


None 57.28 57.31 56.12 1197 


None Linear SVM 26.36 27.82 23.67 1197 
classifier 


4.7 Analysis of Experiments Three and Seven 


Here, experiment three and seven, described in Table 4 are further explored, since 
the former had a large cohort of participants in two age-ranges, whilst the latter 
delivered the best F-score overall experiments, for classification of two single-age 
groups. Tables 7 and 8 show the detailed performance results per class for experiment 
3 and 7 respectively. The analysis included metrics such as the Matthews correlation 
coefficient, informedness, markedness, and prevalence [56] to further inform the 
classification performance results. 

Figures 6a and 7a show the precision and recall curve [56] for experiments three 
and seven respectively that plots precision and recall correspondence for some thresh- 
old values. In Figs. 6b and 7b it is shown the receiver operating characteristic curve 
(ROC) [56] for the same two experiments. This curve demonstrates true positive 
and false positive threshold rates of a machine learning model. As in the case of 
the precision and recall curve, the experiment seven model outperforms experiment 
three in the ROC curve. 


4.8 Discussion 


Our results demonstrated age-related differences in gait patterns in 7 experimental 
cases from a large cohort of healthy adult participants. The experiments compared 
normal walk, fast walk and two managed cognitive load activities. The spatio- 
temporal problem was addressed by a deep machine learning methodology which has 
the ability to learn end-to-end from raw spatio-temporal gait data. The effectiveness 
of the approach was justified by comparing our results to the performance of opti- 
mized shallow machine learning models’ in linear and ensemble machine learning 
models. 

Overall, the optimal performance of the methodology was observed between nor- 
mal walk and fast-walk. Then between normal walk and dual-task two, and finally 
between normal walk and dual-task one. Task two influenced a heavier cognitive 
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Fig. 6 Classification 
performance characteristics 
of experiment three 


Fig. 7 Classification 
performance characteristics 
of experiment seven 
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Precision: Recall Curve 
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Table 7 Performance per class of experiment three 
Experiment Experiment 3: Normal and dual-task two 


Go 


Classes 


Population 4262 4262 4262 4262 


P: Condition positive 1251 899 
N: Condition negative 3011 3429 3363 
Test outcome positive 1362 837 
Test outcome negative 2900 3386 3425 
TP: True Positive 1137 734 
TN: True Negative 2786 3263 3260 
FP: False Positive 91 | 166 103 
FN: False Negative <= 165 
TPR: Sensitivity, hit rate, recall | 0.91 0.85 0.82 
TNR = SPC: Specificity 0.93 0.95 0.97 
PPV: Pos Pred Value (Precision) | 0.83 0.81 0.88 
NPV: Neg Pred Value 0.95 
FPR: False-out 0.07 0.05 0.03 
FDR: False Discovery Rate 0.17 0.19 0.12 
FNR: Miss Rate 0.15 0.18 


MCC: Matthews correlation 0.79 0.81 


coefficient 

Informedness 0.79 
Markedness 0.77 0.83 
Prevalence 0.29 0.21 


Positive likelihood ratio 12.16 28.09 17.61 26.66 


Negative likelihood ratio 0.16 0.19 
Diagnostic odds ratio 123.5 190.33 113.47 140.8 
False omission rate 0.04 0.06 =| 0.04 0.05 


— 
— 
1 


load in participants compared to task one, resulting in a more pronounced gait pat- 
tern, which impacted on the ability of the machine learning model to classify gait 
patterns successfully. Moreover, for participants to perform the arithmetic operations 
of dual-task, coordination among several processes such as articulatory, phonatory 
and respiratory, functions was required, which, might have led to a greater demand 
on the executive function processing [5]. High classification performance was also 
observed with short age-range groups and with a large age gap between groups. 
These characteristics tended to isolate the gait pattern even further. 

The high classification performance obtained in the age-related experiments 
demonstrated that the deep learning methodology presented here may be appro- 
priate for gait data analysis from participants with MCI in large cohort studies [57]. 
People with impaired executive function in the context of a diagnosis of AD have 
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Table 8 Performance per class of experiments seven 


Experiment 

Classes 

Population 

P: Condition positive 
N: Condition negative 
Test outcome positive 
Test outcome negative 
TP: True Positive 

TN: True Negative 
FP: False Positive 
FN: False Negative 


TPR: (Sensitivity, hit rate, 
recall) 


TNR = SPC: (Specificity) 


PPV: Pos Pred Value 
(Precision) 


NPV: Neg Pred Value 
FPR: False-out 

FDR: False Discovery Rate 
FNR: Miss Rate 


MCC: Matthews correlation 
coefficient 


Informedness 

Markedness 

Prevalence 

LR+: Positive likelihood ratio 
LR-: Negative likelihood ratio 
DOR: Diagnostic odds ratio 


FOR: False omission rate 


Experiment 7: Normal and dual-task two 
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0.01 


0.95 
0.24 
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shown an exaggerated dual-task effect in walking patterns compared to cognitively 
healthy controls [58]. This finding has yet to be verified using deep machine mod- 
eling applied to a large sample of participants who have prodromal AD. This might 
enable finding a robust behavioral marker of AD in the very early prodromal stage. 


4.9 Future Directions 


The methodology presented here has the potential to be expanded to a multimodal 
sensor fusion in a set-up for ambient sensing of natural human behavior. Holistic 
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analysis of an unobtrusively captured combination of voice patterns, upper body 
movement, gaze and other human behavioral patterns of participants. The approach 
has the potential to improve our understanding of their relationship in human behav- 
ioral pattern understanding to cognitive decline to the benefit of the detection of 
neurodegenerative diseases in the prodromal stage. 


5 Deep Learning in Security: A Case Study in Biometrics 


The advantage of gait as a biometric modality is that it allows detection at a distance, 
in an unobtrusive environment, it is inexpensive, and difficult to be forged by an 
intruder. However, gait is not a perfect biometric marker, since it has also raised 
privacy concerns. Users of biometrics systems have raised concerns in regards to the 
intrusive nature of the system, due to its ability to acquire signals without the users’ 
consent or knowledge. The difficulties in considering gait as a biometric include 
change in client’s clothes, shoes, subject emotions, among other factors. This is 
where machine learning methods for gait analysis have shown to be effective. 

Footstep recognition uses force signatures made by person footsteps over a floor 
sensor system. This force is known as Ground Reaction Force (GRF). In contrast 
to obtaining gait by video streams, it’s non-intrusive and less prone to effects in 
environmental noisy conditions that might diminish the performance of the system. 

Footstep patterns tend to contain a high degree of variability, thus making visual 
assessment difficult. The discovery of a system that is able to robustly find and isolate 
these patterns automatically is at the forefront of gait biometric research. 

Footstep data was collected from 127 users within an 18-month period. Three 
datasets were introduced for the experiments performed. Benchmark | (B1) dataset 
considers 40 stride footstep samples for 40 clients for training the machine learning 
models. This is the smallest dataset considered (in the number of available training 
signals per user), thus representing a security application. Benchmark 2 (B2) dataset 
considers 200 stride footstep samples for 15 clients for training, this represents a 
middle-level amount of footstep data available to train a machine learning model. 
An application might be for example at a supermarket or a workspace. Benchmark 
3 (B3) dataset contains 500 stride footstep signals for 5 clients. This is the largest 
dataset considered, for an application where a large number of footstep signal can 
be acquired for training, as for example in a home-based scenario. In all cases, an 
evaluation dataset of 500 signals was considered. 


5.1 Aims and General Method 


This study aims to establish a benchmark for the relationship between a managed cog- 
nitive load and gait in cognitively healthy participants from a novel data analysis per- 
spective. We will apply a novel analytic approach based on advanced computational 
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models known as deep machine learning [13]. Current methods in dual-task analysis 
rely on specific statistical features such as gait speed and variability [3—5, 50]. This 
focus has been influenced by human observation, which is intrinsically subjective 
and has limited reach. Usually, only a few experimental gait samples per participant 
have been included in the analysis. 

In contrast, in this study, the dual-task effects are studied using deep machine 
learning principles [13] to automatically define and extract optimal gait features 
harvested from raw spatio-temporal gait data. The data were obtained from an original 
tomographic floor sensor system [51] sampled directly from the raw sensor data 
rather than from reconstructed data [52], which requires further processing. A large 
cohort of 69 participants was recruited, resulting in a sizeable set of gait samples 
per participant experiment that allowed statistical reliability [53]. These aspects are 
features not commonly found in gait analysis research. Furthermore, a large dataset 
is beneficial for the optimal application of deep learning models." as provided in the 
latex source code. 


5.2 Footstep Data as a Biometric 


Footstep feature extraction and feature engineering have played a central role in 
automatic footstep recognition research [12]. This procedure involves the careful 
selection and design of very complex and time-consuming hand-crafted features for 
footstep recognition. The features include Geometric, Holistic, Spectral and Wavelet 
approaches to name a few [12]. Automatic feature learning models [13] have not 
been well studied for biometric footstep recognition using floor sensors systems. 

Research studying footstep data as a biometric collected footstep signals from: 
(1) switch sensors [59, 60] which analyzes the spatial distribution of the footstep 
signals, and (11) pressure sensors [26—28], focusing on dynamic pressure information 
in the signals, but with low spatial resolution. Qian et al. [61] use a commercial 
pressure mat with high resolution is used by in order to extract the center of pressure 
information, therefore using time and spatial pressure information only for some 
selected key points (geometric approach). 

Recently, footstep signals in temporal and spatial domains were analyzed [29], 
reporting experiments on the SFootBD. The spatial information is extracted from 
accumulated pressure images. Temporal information was extracted from the average 
GRE and from other hand-crafted features. Principal Component Analysis (PCA) was 
used for dimensionality reduction of the footstep data and a non-linear SVM is used 
for biometric verification. Results were obtained in the range of 2.5—10% Equal Error 
Rate (EER) were achieved depending on the application setting. In [36] we reported 
a pilot study of a convolutional neural network model to learn processed spatial 
footstep features of the SFootBD database, suggesting significant improvements of 
footstep recognition performance compared to existing work [29]. 

Table 2 shows the recognition performance of the approach compared to other 
known biometric verification systems based on floor sensor data only. The other 
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Fig. 8 Two-stream 
spatio-temporal resnet 
architecture for raw footstep 
representation 





studies do not use the SFootBD database, thus cannot be directly compared to this 
work in terms of performance since the experiments differ in the number of clients 
and footstep signals. However, we are using a much larger database and therefore 
the performance results are more statistically significant. 

In this report, we analyze the effect of evaluating a set of diverse footstep data 
representations in machine learning models. Two representations worked best overall 
for the spatio-temporal biometric verification problem presented here: raw footstep 
data and processed footstep data. 


5.3 Deep Residual Network Model 


The deep machine learning models used in this work are based on the state-of-the-art 
resnet architecture [62]. 

The resnet architecture is illustrated in Fig.8 consisting of spatial and temporal 
streams for the raw representation. From input to output, each stream consists of the 
following layers: First, there is a resnet configuration | block (2ay) (Fig.9 right), 
followed by resnet configuration 2 block (x2) (2by and 2cy) (Fig.9 left), then an 
average pooling layer, fully connected layer (FC) and finally a softmax layer. The 
blocks consist of convolutional layers, batch normalization [63] and ReLU activation 
functions [64]. The residual units in the network can be expressed in general form 
as: 

yt = h(x) + G(x, Wi), (2) 


X41 = fv, (3) 


where x; is the input to the /-th residual block, and x ;,, is its corresponding output 
and G is a non-linear residual function. h(x;) = x; 1s an identity mapping, f is a 
RELU activation [64] function. W; = {W,;.{|1 <k < K} 1s the set of weights and 
biases of the /-th residual block. K is the number of layers in a residual unit. If f is 
an identity mapping, then x;; = y,, therefore Eq. 3 can be expressed as: 


X41 = X1 + G(x), W)). (4) 
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For any unit of L and shallow unit /, the forward propagation of the feature x; can 
be expressed as an additive output: 


L=1 


x, =x) + > G(x, W)). (5) 
i=l 


Therefore, during forward propagation, x; 1s propagated to any x, plus the residual 
factor. If the loss function is expressed as yy, the backpropagation of errors in the 
network can be expressed as the chain rule [65]: 


dy ody ax, 0 i 
a Et = GH, WD). (6 
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By using the Resnet as feature extractor, it eases the evaluation of the verification 
biometric system. This by evaluation the learn feature set with a discriminative linear 
classifier. This allowed saving computational resources and time. The linear classifier 
selected for the evaluation of the experiments was a linear Support Vector Machine 
(SVM), due to its high biometric performance when compared with other linear 
classifiers such as logistic regression or perception. If u is considered as the total 
number of clients for a given experiment, then is required to train u linear SVM 
classifier models using the Resnet models as a feature extractor, instead of training 
u Resnet models which are computationally expensive to train. 

The RMSprop [66] optimizer was selected to update the model’s weights due to 
its stability at training time. All models were trained with a Batch size of 32 samples. 
Initialization of the models with ImagNet Resnet-50 [67] weights for transfer learn- 
ing was tested without major improvements, therefore the weights were initialized 
instead by sampling values from a Gaussian random distribution to ease the initializa- 
tion process. The RMSprop learning rate was set initially at 0.001 and decreased by 
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a factor of 10 once the learning error plateaus. An early learning stopping procedure 
was implemented: we stop training once the validation error stopped decreasing. 


5.4 Spatial and Temporal Architectures 


As footstep GRF patterns tend to contain a large degree of fine-grained GRF vari- 
ability they are difficult to visualise for evaluation by humans Figs. 10 and 11 shows 
a side by side comparison of stride raw (top) and processed (bottom) spatial footstep 
representations from 2 clients of the SFootBD, considering 2 samples per user. The 
comparison implies that effective footstep recognition based only on visual percep- 
tion is a very challenging problem as there can be a high user intra-variability and 
low inter-user variability in some cases. Moreover, humans are not accustomed to 
recognizing this type of images as opposed to other biometric traits such as facial 
recognition. Machine learning has been used in an attempt to solve differentiating 
the fine-grained GRF variability between clients and impostors. 

The spatial and temporal footstep data share the same resnet architecture shown 
in Fig.8. The input footstep representations affect the dimensions of the first 


Fig. 10 Spatial raw (top) 
and spatial processed 
(bottom) footstep 
representations of user 1. a 
Sample | b sample 2. Top 
representation dimension is 
13 x 14 pixels. Bottom 
representation dimension is 
88 x 88 pixels 





Fig. 11 Spatial raw (top) (a) (b) 
and spatial processed 
(bottom) footstep 
representations of user 2. a 
Sample | b sample 2. Top 
representation dimension is 
13 x 14 pixels. Bottom 
representation dimension is 
88 x 88 pixels 
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convolutional (conv.) layer of the resnet model, it takes as input a stride footstep 
tensor of shape (n,m,c) where n x m 1s the 2D footstep sensor matrix and c the 
frames. c = | for the spatial case and c = 100 for the temporal component. The filter 
size of the resnet blocks (Fig. 9) and channels change according to the input footstep 
tensor dimensions. 

The widely-used deep network design introduced by the VGG net [68] is adopted 
for the resnet models. The methodology decreases the spatial component at the conv. 
layers as a function of increasing the number of filter maps, from the left (input) to 
the right (output) layers of the network. 


5.5. Verification System Evaluation 


The verification system performance was evaluated by using the Detection error 
trade-off (DET) curve [69], which displays a trade-off of missed detection and false 
alarm errors. We also used the Equal Error Rate (EER) to summarise the biometric 
verification performance of the system. The EER is the intersection in the DET curve 
where the False Rejection Rate (FRR) and the False Acceptance Rate (FAR) are equal. 
Therefore, we are giving equal importance to FRR and FAR for the evaluation of our 
experiments. 


5.6 Results 


5.6.1 Airport Scenario: Benchmark B1 


For this benchmark, the fusion of the spatial and temporal domains performs best 
overall for the 3 representations considered. Separately, the raw representation deliv- 
ered the best performance (11.80%, 11.50%) EER followed by the processed SVM 
representation with (8%, 12.50%) EER and lastly, the processed representation 
obtained (10.10%, 14.50%) EER. 

For the fusion of representations, the raw and processed representations deliver 
(8.10%, 10.70%) EER. This is also a better performance than considering the two rep- 
resentations separately. While the combination of the raw, processed and processed 
SVM representations delivers the optimal performance overall (7.10%, 10.50%) 
EER. This improves the previous reported optimal performance [29] by 2% EER 
in evaluation and 0.9% EER in validation datasets. This benchmark considers the 
least amount of footstep data for training from the 3 benchmarks. The benchmark 
exemplifies a real-world security application, where data is scarce. 
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5.6.2 Workplace Scenario: Benchmark B2 


Spatio-temporal fusion performs best overall for the 3 representations in this dataset. 
The processed SVM delivers the best performance from the 3 representations with 
(3.80%, 6.70%) EER. The raw and processed SVM representation delivers the same 
evaluation performance of 8% EER. In validation, the raw representation obtains 
6.10% EER while the processed SVM delivers better performance of 3.80% EER. 

For the fusion of representations, the raw and processed representations deliver 
(3.20%, 5.30%) EER this performs better than any of the representations consid- 
ered. The combination of the raw, processed and processed SVM representations 
deliver the optimal performance overall of (2.80%, 4.90%) EER for this dataset. 
This improves the previous reported optimal performance [29] by 1.8% EER in 
evaluation and 1% EER in validation datasets 

This benchmark considers a medium amount of footstep data for training from 
the 3 benchmarks. An office security environment exemplifies a real-world scenario. 


5.7 Home Scenario: Benchmark B3 


Spatio-temporal fusion performs best overall for the 3 representations. The pro- 
cessed representation delivers the best performance with (1.80%, 2.60%) EER. The 
processed SVM follows with (2.10%, 3.20%) EER and lastly, the raw representation 
obtained (1.70%, 5.60%) EER. 

At the fusion of representations level, the raw and processed representations 
deliver (0.80%, 2.10%) EER performing better than considering the representations 
separately as in previous benchmarks. The combination of the raw, processed and 
processed SVM representations deliver the optimal performance overall (0.70%, 
1.70%) EER for this dataset and overall in all experiments. This improves the pre- 
vious reported optimal performance [29] by 2.3% in the evaluation and 1.4% in 
validation datasets These results are the best overall considering all experiments and 
benchmarks. 

This benchmark considers the largest amount of footstep data for training from 
the 3 benchmarks, thus the best performance observed overall experiments. 

We argue that the best performance observed here overall experiments is since 
the largest amount of footstep data is considered for training the Resnet models. A 
home environment exemplifies a real-world security application of this dataset, and 
where the proposed methodology and models would optimally work. 


5.8 Discussion 


The partition of the test datasets into validation and evaluation subsets allows eval- 
uation of the model’s generalisation performance with high confidence since the 
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Table 9 Biometric verification results in terms of EER (in %) for benchmarks B1, B2 and B3 


Model Benchmark B1 Benchmark B2 Benchmark B3 
(40 clients) (15 clients) 









(5 clients) 


Val. (%) | Eval. (%) | Val. (%) | Eval. (%) | Val. (%) 
Raw representations 


Processed representations 


Processed SVM representations 


temporal 


Fusion of representations 


Raw and processed 


Resnet 7.10 10.50 2.80 4.90 0.70 
and processed SVM_ | and SVM 





Eval. (%) 

















evaluation dataset never influence the training process directly (training set) or indi- 
rectly (validation set). Overall, the validation dataset EER is better than the evaluation 
dataset due to the generalisation of the model in held-out footstep data. We are able to 
provide better performance results in all benchmarks when compared with previously 
reported work [29]. 

The validation dataset performance influences the early stopping procedure at the 
training time of the resnet models, thus indirectly influencing the generalization per- 
formance of the system. However, this is a widely used procedure, and by providing 
an EER performance in a held-out dataset (evaluation) a closer and more realistic 
estimate of the generalization performance is provided. 

Deep residual networks are known to show state-of-the-art performance for prob- 
lems that use large amounts of footstep data for model training, such as ImageNet 
[13, 67] which contains millions of samples for training. This effect can be shown 
for both the validation and evaluation dataset performance results shown in Table9, 
as data available per model increases. 
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The raw and processed resnet representations obtained very similar performance 
EER in the 3 datasets as observed in Table 9. Therefore, the raw models are able to 
provide competitive performance from raw unprocessed footstep data evaluated in a 
learning model when compared with processed footstep data. 

This section has explored the important effects of testing spatio-temporal input 
footstep data representations in machine learning models based on deep residual 
networks. The representations are based on footstep raw and processed data. We 
compare its performance with a processed representations approach using a SVM 
The two methods delivered similar performance. The critical factors that affect foot- 
step biometric verification performance are the spatio-temporal data representations 
considered and the amount of data considered for training. 

Three datasets from the largest footstep database were considered for the spatio- 
temporal analysis. The dataset resembles data-driven real-world scenarios, including 
a small footstep dataset for security applications (Benchmark B1), a medium size 
dataset for office-oriented applications (Benchmark B2), and a large dataset for home- 
based scenarios (Benchmark B3). These scenarios intend to cover the most common 
real-world scenarios. 

The experiments performed here have proven that there is not a single optimal 
representation for all datasets. Considering the representations separately, for Bench- 
mark B1 the raw representation performs optimally, in Benchmark B2 the processed 
SVM delivers optimal verification performance and for Benchmark B3 the processed 
SVM representation performs best overall, this justifies this research in terms of 
evaluation of several representations in machine learning models in order to obtain 
a robust footstep recognition model. 

This result highlights the need for raw data representation analysis for automatic 
feature learning models. We have demonstrated that an ensemble of resnet and SVM 
models using processed and unprocessed footstep data obtain a robust footstep recog- 
nition model for biometric verification. 


6 Conclusions 


In this chapter spatio-temporal gait and footstep representations have been studied 
with deep learning methodologies. In the healthcare theme, dual-task has been clas- 
sified with robust classification performance by providing an F-score of 97.33% in 
the optimal case, while in the security theme, state-of-the-art footstep recognition 
performance has been obtained in a biometric verification scenario, obtaining an opti- 
mal EER of 0.7%. Therefore, robust pattern recognition in gait and footstep analysis 
have been provided with high statistical significance. The methodologies to obtain 
the optimal results used deep machine learning principles based on convolutional 
neural networks. 

In the healthcare theme, the link between cognitive activities and their effects 
on the changes in human gait patterns was investigated. The research analyzed 
of the effect of cognitive activities in gait patterns from healthy individuals. The 
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methodology delivered results with a cohort of 69 participants performing dual- 
tasks experiments. In the optimal case scenario, an F-score of 97% was obtained to 
identify dual-tasks patterns. The methodology clearly outperformed optimized clas- 
sical machine learning models (non-deep learning) and was able to distinguish the 
gender of participants with an optimal F-score of 97.3%. 

In the security theme, state-of-the-art footstep recognition performance results 
were obtained in challenging biometric verification scenarios. The largest to date 
footstep database, the sfootBD, was used to validate the deep machine learning 
methodology. The database has almost 20,000 footstep signals from 127 users. First, 
the spatial footstep domain was studied in a single real-world biometric verifica- 
tion setting, then the spatio-temporal footstep domain was studied extensively in 
three critical real-world biometric verification scenarios: at home, office and air- 
port scenarios. The optimal results demonstrated that an ensemble of processed and 
raw footstep data and a combination of shallow and deep machine learning mod- 
els delivered state-of-the-art recognition performance for biometric verification. The 
methodology delivered a 0.7% EER in the optimal biometric verification case. 

The methodology presented in the healthcare theme may be potentially applied 
to studies of large cohorts of users in the MCI stage or with the pathology of AD 
[57, 70]. This research direction could further investigate the link between changes 
in gait and neurodegenerative disease progression at early stages. The deep learn- 
ing methodologies presented here can be applied in a multi-sensor spatio-temporal 
environment to search for further behavioral signatures that may flag AD in the 
prodromal stage. This approach could include other sensing modalities for example 
by integrating keyboard use patterns of individuals in daily computer use [71] or 
speech recognition analysis [72]. Furthermore, accelerometer sensors, cameras or 
indoor/outdoor tracking systems [70] could also be integrated for robust analysis of 
early stages of AD. 
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Min Wu and Xiaoli Li 


Abstract Building Energy efficiency has gained more and more attention 1n last few 
years. Occupancy level is a key factor for achieving building energy efficiency, which 
directly affects energy-related control systems in buildings. Among varieties of sen- 
sors for occupancy estimation, environmental sensors have unique properties of non- 
intrusion and low-cost. In general, occupancy estimation using environmental sensors 
contains feature engineering and learning. The traditional feature extraction requires 
to manually extract significant features without any guidelines. This handcrafted 
feature extraction process requires strong domain knowledge and will inevitably 
miss useful and implicit features. To solve these problems, this chapter presents a 
Convolutional Deep Bi-directional Long Short-Term Memory (CDBLSTM) method 
that consists of a convolutional neural network with stacked architecture to auto- 
matically learn local sequential features from raw environmental sensor data from 
scratch. Then, the LSTM network is used to encode temporal dependencies of these 
local features, and the Bi-directional structure is employed to consider the past and 
future contexts simultaneously during feature learning. We conduct real experiments 
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to compare the CDBLSTM and some state-of-the-art approaches for building occu- 
pancy estimation. The results indicate that the CDBLSTM approach outperforms all 
the state-of-the-arts. 


Keywords Deep learning - Building occupancy estimation - Environmental 
sensors * CDBLSTM 


1 Introduction 


To maintain the thermal comfort of indoor environments, around 40% of the energy 
has been consumed in building sectors [28]. Thus, a lot of attention has been paid on 
building energy efficiency and sustainable development. To achieve that, a crucial 
factor is the building occupancy information, also known as occupant number or 
range in buildings. It can be used for building climate and adaptive light control 
[28, 36]. Balaji et al. saved 17.8% of energy for HVAC systems relied on actual 
occupancy levels in a designed experiment [1]. A light control system developed in 
[24] has reported a reduction of 35-75% of energy consumption for building light 
control systems. However, to obtain an accurate and robust occupancy estimation 
system is a challenging mission and remain unsolved. 

Occupancy estimation can be done by the use of different sensors. For instance, Liu 
et al. present a detection of the absence and presence of occupants via PIR sensors 
[27]. It will be more meaningful to obtain the actual occupant number or range 
indoors. In order to fulfill that, the methods relied on RFID and wearable devices 
were presented in [1, 25]. However, these approaches require users to wear specific 
devices, which is intrusive and inconvenient. Accurate occupancy estimation can be 
achieved by using cameras [42]. However, camera based solutions often suffer from 
the problems of insufficient illumination and high computational load. Besides, they 
also have the issue of privacy concerns. Some other methodologies rely on occupants’ 
involvement, such as using chair sensors [23] and applicants power usage data [22]. 
However, occupants that do not involved will not be able to be detected. 

Recently, environmental sensors are widely adopted for occupancy estimation, 
because they are low-cost and non-intrusive for users [21, 29, 40, 41]. Due to the 
complex relationship between environmental sensor measurements and occupancy 
levels, physical modeling is with limited performance. An alternative way is to model 
the complex relationship by using machine learning techniques which work well 
on function approximation. Since, environmental sensor data are with large noise 
and not representative for different occupancy levels, the machine learning mod- 
els trained with raw sensory data may have limited performance. The common 
operation is to perform feature engineering which intends to extract more infor- 
mative representations for different occupancy levels [26]. However, the traditional 
manual feature engineering does not have a guideline on which features should be 
extracted for occupancy inference. In addition, it requires strong domain knowledge 
and will inevitably miss implicit and useful features. To solve this problem, this 
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chapter presents a Convolutional Deep Bi-directional Long Short-Term Memory 
(CDBLSTM) that consists of a convolutional neural network with a stacked struc- 
ture to learn useful representations (features) automatically from scratch [11]. The 
convolutional network is able to learn some sequential local features from raw envi- 
ronmental sensor data. Since the environmental sensory data 1s a typical time series, 
temporal dependencies are of great importance for accurate and robust occupancy 
inference. To model the temporal dependencies in data, we adopt a BLSTM network 
whose inputs are the sequential local features learned by the convolutional neural 
network. We have compared the CDBLSTM approach with some state-of-the-arts in 
existing literature by using real evaluation. 


2 Literature Review 


Many advanced algorithms have been presented for occupancy inferences in build- 
ings using environmental sensor data. The authors in [13] presented an occupancy 
estimation system for an open office room by using sensor networks that are able to 
collect data of CO, CO, acoustics, PM2.5, motion, illumination, temperature and 
humidity. Some statistical features, e.g., moving average of 20-min and Ist order 
difference, were manually extracted. Next, the most important features were chosen 
via the popular information gain theory. Finally, data-driven methods including Sup- 
port Vector Machine (SVM), Artificial Neural Network (ANN) and Hidden Markov 
Model (HMM) were utilized for occupancy estimation. They made a conclusion that 
the most significant sensors are CO» and acoustic, and the HMM achieves the best 
performance for occupancy estimation. 

The authors in [30] employed environmental sensors of temperature, CO», humid- 
ity, and pressure, to estimate occupancy for a tutorial room. They extracted some 
similar features used in [13]. An ELM-based wrapper algorithm was developed for 
feature selection and occupancy inference. 

In [38], the authors investigated various sensors including sound, motion, tem- 
perature, door state, CO», humidity, passive infrared and light to infer occupancy 
in both multi-occupant and single-occupant offices via some widely used machine 
learning algorithms. Instead of extracting more useful features, they used raw sensor 
data as features. Here, the authors applied many informative sensors to guarantee 
a satisfactory performance of their proposed method. The contribution of different 
sensors (features) were tested by using the theory of information gain. Eventually, 
light level, door state and CO, are shown to be the most important parameters. For 
different algorithms, the decision tree (DT) approach has the best performance. 

Candanedo et al. developed an occupancy detection system with sensors of humid- 
ity, CO, temperature and light levels [3]. They also used the raw sensor data as 
features in this work, and utilized some statistical models identify the two states of 
absence and presence of occupants. Different combinations of features with distinct 
statistical approaches were tried, and then the best sensors and models can be selected. 
At last, they made a conclusion which claims that a satisfactory performance is able 
to be fulfilled when properly selecting sensors and learning methods. 
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Since occupancy dynamics has the Markov property [4, 7, 8], the HMM model 
has achieved great success for building occupancy detection and estimation [13]. 
But, the traditional HMM often suffers from some limitations, such as the use of 
mixture of Gaussian model to estimate emission probabilities and the fixed transition 
probability matrix. To solve these issues, the authors in [12] presented an IHMM- 
MLR for environmental sensor based occupancy inference. Firstly, inhomogeneous 
transition probability matrices for capturing occupancy dynamics at distinct time 
steps were developed. Then, multinomial logistic regression to produce the emission 
probabilities with environmental sensor data was designed. Two schemes, 1.e., online 
and offline, were formulated to infer occupancy in distinct situations. 

Chen et al. presented another system to enhance the performance for occupancy 
estimation by considering occupancy properties [6]. They performed a fusion of tra- 
ditional machine learning algorithms with a well-developed occupancy model which 
is able to show occupancy properties. The sensors they utilized include CO», humid- 
ity, pressure and temperature, which is widely available. The algorithms include 
ELM, SVM, ANN, KNN, CART and LDA. They formulated a Bayes filter to fuse 
the occupancy model and six data-driven algorithms for the estimation of occupancy. 
A detailed survey for occupancy estimation can be found in [5]. 

Here, we leverage on the environmental sensors including temperature, CO», pres- 
sure and humidity that are popular in normal HVAC systems [14] instead of applying 
specific sensors, such as acoustic level [13, 38], motion [19, 38] and light level [3]. 
Without applying the noisy sensor data as features or using some handcrafted statisti- 
cal features, we attempt to automatically extract some useful local sequential features 
by using the convolutional neural network with stacked structure. Then, the BLSTM 
network is able to encode temporal dependencies for sequential local features during 
high-level feature learning. We have made a comprehensive comparison with some 
state-of-the-arts by using actual experiments. 


3 Methodology 


We firstly demonstrate an overview of the CDBLSTM for environmental sensor 
based occupancy inference. Then, we introduce the key components in CDBLSTM, 
1.e., the convolutional neural network, the DBLSTM, and the classification layers. 
Finally, the introduction of the training process of the CDBLSTM approach will be 
covered. 


3.1 Overview 


For environmental sensor based occupancy estimation, the key partis to learn discrim- 
inative representations (features) from raw data for distinct occupancy levels. Figure | 
presents the CDBLSTM framework for environmental sensor based occupancy 
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Fig. 1 Framework of the 
CDBLSTM approach [11] 
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inference. Raw input is a sliding window of environmental sensor data. Then, a 
convolutional network with multiple filters is applied for learning features of local 
sliding windows known as local feature learning, which is of great importance for 
distinguishing data from different occupancy levels. Next, the DBLSTM is leveraged 
to encode temporal dependencies of local sequential features in forward and back- 
ward directions. Finally, the learned high-level features from the DBLSTM are fed 
into fully connected and softmax layers for the classification of different occupancy 
levels. 


3.2 Convolutional Operation 


We implement convolutional neural network on environmental sensor data to pro- 
duce sequential local features. Generally, 1t contains a convolutional layer, together 
with a pooling layer. Figure 2 shows the convolutional and pooling operations on 
environmental sensor data. The functionality of the convolutional operation is to use 
a Sliding window over the raw time-series data to get sequential local features. And 
then, the pooling operation is to reduce feature dimension of the sequential local 
features. The detailed implementation of the two operations will be presented below. 


Convolutional Layer: Suppose that the n input samples are {X;},i = 1,2,...,n, 
and each input sample X; € R”*¢ is a sliding window environment sensor data, where 
r is the length of sequence and d is the number of sensors. It can also be represented 
as X; = [x;,..., X;]. The definition of the convolution operation is to multiply a 
filter vector v ¢ R'”@*! with a slice of the input X;:;4.n—1 € R’”¢*! which is shown as 
follows 

Xj:i4m—1 = Xi BXi41 B+ B Xi4m_1 (1) 


where m denotes the windows size and @ 1s the concatenation operation. Next, an 
activation function is performed over the multiplied results, shown as 


ci = 9 (V! Xi:i4m—1 +b) (2) 


where g(-) is the activation function, D is the bias term and T is the transpose opera- 
tion. The widely used ReLU activation function [31] is adopted. By sliding the filter 
from the beginning of the input sequence to its end, we can produce a feature map, 
shown as follows: 


c/ = [Cts 0255025, Cn | (3) 


where j = 1, 2,...,k, and k is the number of filters. 


Pooling Layer: The pooling operation is to reduce feature dimension, leading to more 
discriminative features [15]. In this work, we adopt the widely used max-pooling 
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which conducts an operation of maximum on s consecutive components of feature 
map c/. After pooling operation, the features will be 





z! = baF La Shay Zrom 44] (4) 
where z; = Max (Cjs_—s, Cis—s+15 «+> Cis—1)- Hence, the pooling operation will generate 
compressed feature map z/, 7 € 1,2,...,k. Eventually, the output of the convolu- 


tional neural network will have a feature dimension of (— + 1) xk. 

In general, assume the number of samples n, the input data has a dimension 
of n x r x d. The output of the convolutional neural network has a size of n x 
(7 Fakes 1) x k. It can be found that the length of the input data is compressed from 
r to (f= + 1). In addition, the data dimension changes from d (number of sensors) 
to k (number of filters), where k is much larger than d. This means that the data 
becomes more informative. In other word, the convolutional neural network can be 
treated as a local feature learned which 1s able to get more informative representations 
and preserve the temporal information from raw environmental sensor data. 





3.3 Deep Bi-directional LSTM 


Recurrent Neural Network (RNN) is widely used for the modeling of time series data 
thanks to its strong sequential modeling capacity. However, the conventional RNN 
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often has the problem of gradient vanishing or exploding during training. This dra- 
matically influence the performance of RNN on modeling long-term dependencies 
in time-series data [2]. To solve this issue, the authors in [17] proposed a new archi- 
tecture, named LSTM, which attempts to use some gates to control the information 
for preserving or discarding, such that it is able to capture long-term dependencies 
of the sequence. The LSTM network has been successfully employed in a num- 
ber of important and challenging tasks, e.g., activity recognition [9, 10] and natural 
language processing [34]. The conventional LSTM only considers the sequential 
information in one direction, that is the forward direction. This is not adequate for 
sequential modeling of environmental sensor data. The future information may also 
be useful. To consider both the future and past contexts for occupancy inference, we 
adopt the BLSTM which contains a forward layer and a backward layer to process 
sequential data in the forward and backward directions. 

Recently, deep structures have achieved great success in representation learning 
[16]. The Deep Bi-directional LSTM (DBLSTM) which stacked multiple BLSTM 
layers is adopted in this study to encode the temporal dependencies and learn high- 
level features from the sequential local features extracted by the convolutional neural 
network. In addition to that, the DBLSTM is able to make the inputs to propagate 
through time and space (layers), simultaneously, such that, the model parameters are 
able to distribute over layers instead of enlarging memory size of the network. This 
will result a more efficient non-linear operation of the data and is also the ultimate 
purpose for stacking multiple layers in deep learning [16]. Figure3 illustrates a 
hidden layer / at time step t — 1, t and t + 1 of the DBLSTM network, where the 
arrows pointing to the left and right denote the backward and forward operations 
respectively. Here, the forward operation from time step t — | to ¢ 1s to capture the 
past information, and the backward operation from time step ¢ + 1 to ¢ is to model 
the future information. We use one hidden layer / at time step t as an example to 
show the detailed operation of the DBLSTM network. Assume that hj_, is the hidden 
state, C H ~! is the memory cell state, w) , wi, wr and w, are the weights, bi _ bi, br 
and b’ are the biases, and o(-) denotes the sigmoid activation function. The forward 
process shown as — and the backward process shown as < can be formulated as 
follows: 
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Fig. 3 Structure of DBLSTM 
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The final output of the /-th hidden layer at time t of the DBLSTM network is a 
concatenation of the forward and backward layers, which can be expressed as 


n= h'@h' (7) 


where hj can update the current hidden state by using the past information, that is 


the time from 1 tot — 1, and hf} can update the current hidden state by using the 
future information, that is the time from ¢ + 1 tor. 
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3.4 Occupancy Inference Layers 


The outputs of the DBLSTM network are high-level features which will be fed into 
some fully connected layers to get more abstract representations. The expression of 
the fully connected layers can be shown as: 


o' = 9 (aiu' + fi) (8) 


where jz! and o! are the input and output of the i-th fully connected layer respectively, 
a; and 6; are the weights and bias respectively, and g(-) is the activation function. We 
choose the activation function of ReLU in this study. Suppose that we have stacked 
c fully connected layers, the output of the last fully connected layer, known as o°~!, 
is the final representation of the input data. The final feature representations are fed 
into a softmax classification layer to obtain the occupancy. 


3.5 Training Process of the CDBLSTM 


With the outputs of the CDBLSTM and the true labels (occupancy ranges), the errors 
can be calculated over all the training data, and then error gradients will be derived 
and back-propagated to adjust model parameters for the training of CDBLSTM 
[37]. More precisely, given training data with the true occupancy levels, the network 
outputs can be calculated. Then, the cross-entropy losses can be derived based on 
the network outputs and true occupancy levels. Next, we can get the error gradients 
to back-propagate for the adjustment of model parameters via some gradient based 
optimization algorithms. In this study, we adopt the popular optimization method of 
RMSprop [35]. Precisely, given 6, the parameter for optimization, and L(6,) the loss 
function, the parameter update of 6,,) by using the optimization method of RMSprop 
can be calculated as: 


041 =VH +A — y)VLO)? (9) 
nV L (61) 


oi a 


where g; 1S a moving average of the squared gradient at time step f, and the learning 
rate n, the parameter y and the decaying rate € are chosen to be 0.001, 0.9 and 0, 
respectively. 

In order to alleviate the overfitting problem, we use the technique of dropout. By 
using dropout, we will randomly mask parts of the hidden nodes with probability p 
during training. Figure 4 illustrate the operation of dropout. During model training, 
a thinned architecture will be preserved and trained each time. Given a network 
containing n nodes with a dropout probability of p equaling to 0.5, the network 
could be treated as an ensemble of 2” thinned networks. Due to the shared structure 


O41 =O, — (10) 
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Fig. 4 The operation of dropout. Left: the network without dropout; Right: the network after 
dropout. Crossed nodes have been dropped during model training [33] 


of these thinned networks, the number of parameters will remain the same. During 
testing, the dropout will be switched off and all the network nodes will take effect for 
model outputs, which is similar to an ensemble of some distinct thinned networks. In 
other words, the dropout is used to enlarge training data size. In each training iteration, 
random masking will also create some variants into data, which will make the trained 
network more robust. The dropout technique has been shown to be effective for 
preventing Overfitting [33]. Therefore, in this study, we leverage on one dropout 
layer between the DBLSTM and the first fully-connected layer and another dropout 
layer between the two fully connected layers, where the masking probabilities are 
chosen to be 0.5 and 0.3 respectively. 


4 Evaluation Results 


In this section, we firstly introduce the data acquisition process. Then, evaluation 
setup and experimental results are presented. After that, the generalization perfor- 
mance of the CDBLSTM 1s analyzed by randomly selecting the data for training and 
testing. Finally, to further demonstrate the performance of CDBLSTM for building 
occupancy inference using environmental sensors, we demonstrate additional results 
of the CDBLSTM using data collected from another environment, 1.e., a tutorial room. 
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4.1 Data Collection 


The sensor data of CO», temperature, air pressure and humidity have been collected 
from a research lab at a university campus. The lab has an office area which contains 
24 cubicles and 11 open seats. Generally, nine postgraduate students and eleven 
research staffs will work at the office area. Besides, the lab also has six PCs for 
undergraduate students on their final year projects and five PCs for other students. 
It is well known that identifying the exact occupancy (number) is very challenging 
and may require to use some high-cost sensors in a crowded space. Here, instead of 
estimating the exact occupancy, we divide the exact occupancy into ranges of zero, 
low, medium and high. These occupancy ranges are enough for common building 
control and scheduling systems [18]. To make the four ranges balanced, which will 
maximize the impact of state changes, we define the low occupancy as 1—6 subjects, 
the medium occupancy as 7—14 subjects, and the high occupancy as larger than 14 
subjects. 

We measure pressure level by leveraging on Lutron MHB-382SD sensor, and 
CO», temperature, and relative humidity by using the CL11 sensor from Rotronic. The 
sampling frequency is one sample per minute for both sensors. During data collection, 
we firstly stored the data in the sensor internal memory and then transmitted to a PC 
by using a USB cable. Note that, the area is air-conditioned by the conventional 
Variable Air Volume and Active Chilled Beam systems, and is ventilated by Air 
Handling Unit (AHU) that will constantly provide fresh air. 

Table 1 shows the accuracy and resolution of the sensors. During experiments, 
we attach the sensors on supporters with a height of 1.1m from the ground. Figure 5 
illustrates the layout of the apace which has a size of 20m x 9.3m x 2.6m. We 
apply two pairs of sensors in this space. Here, the placements of sensors are intu- 
itively selected considering occupant density. To get ground truth occupancy, we 
deploy three IP cameras at each door to record occupant movements. Then, the true 
occupancy is counted manually with the help of motion detection software which is 
able to take pictures when occupants move. The entire space contains three doors. 
The main door (placement of camera |) connects the space with the office area for 
administrative staffs. Another door which locates at camera 2 in Fig.5 opens to a lab 
space. And the third door is always closed. Note that, all windows are closed, due to 
the operation of air-conditioning and ventilation systems. 

Totally, we collected 31 days of data in workdays, where the first 26 days of data 
are utilized for model training and the rest 5 days of data is utilized for model testing. 
Since building control systems are with slow response, a resolution of 15-min is 
enough for occupancy estimation [39]. But the original sensor data and occupancy 
have a resolution of | min, we firstly transfer them into a 15-min resolution by using 
the simple averaging. Note that, the number of occupants are an integer value, so that 
a rounding operation is conducted after the use of averaging on original occupancy. 
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Table 1 The accuracy and resolution the sensors 


Sensor Environmental Resolution Accuracy 
parameter 


Rotronic CL11 +5% of the measured value 
103°K 
<2.5% RH 

Lutron MHB-382SD +2 hPa 
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Fig. 5 Layout of the research lab 


4.2. Evaluation Setup 


To evaluate the performance of CDBLSTM, a comparison has been made between 
the CDBLSTM and some state-of-the-arts including the HMM approach with the 
information gain based feature selection of some statistical handcrafted features 
(Dong’s method) [13], the DT with raw data for features (Yang’s method) [38], the 
ELM with the wrapper based feature selection of some statistical handcrafted features 
(Masood’s method) [30], and the LDA with raw data for features (Candanedo’s 
method) [3]. 

The DBLSTM without the convolutional network for local sequential feature 
extraction is also implemented for comparison. Since we choose the resolution to 
be 15-min and the sampling frequency of sensors is 1-min, the length of the input 
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sequence r is 15. With 2 pairs of sensors shown in Fig. 5, the total number of sensors 
d is 8. Hence, the input is with a dimension of 15 x 8 for environmental sensor based 
occupancy estimation. We use cross-validation with the training data to choose proper 
hyperparameters for all the approaches. Specifically, the DBLSTM consists of three 
BLSTM layers with hidden nodes of 24, 75 and 100. Then, two fully connected 
layers with hidden nodes of 150 and 100 are adopted. For the CDBLSTM approach, 
the window size, the pooling size and the number of filters are chosen to be 3, 2, 
100, respectively. The CDBLSTM contains three BLSTM layers with hidden size 
to be 100, 150 and 200. The two fully-connected layers have 200 and 300 hidden 
nodes. The implementation of the deep algorithms, i.e., CDBLSTM and DBLSTM, 
is under Keras. The other shallow algorithms are performed using Matlab. 

Here, occupancy estimation is regarded as a typical classification problem. Hence, 
the criterion of classification accuracy can be adopted for model performance evalu- 
ation. Besides, we use another widely used evaluation criterion of Normalized Root 
Mean Square Error (NRMSE) which will show the range of classification errors 
[38]. As we all know, the absence and presence are of great significance for building 
control systems, especially the light control system [32], the detection accuracy of 
the two states is also analyzed. 


4.3 Evaluation Results 


The evaluation results for different methodologies under the defined three evaluation 
criteria are shown in Table 2. Candanedo’s and Yang’s approaches which applied the 
raw data as features performs the worst. Note that Candanedo et al. [3] and Yang 
et al. [38] used many sensors in their works to guarantee the satisfactory performance, 
which is not practical due to the high cost and the inconvenience caused by constant 
maintenance. Masood’s and Dong’s approaches performs better than Candanedo’s 
and Yang’s approaches, due to the use of statistical features instead of raw data 
for features. These results clearly show that feature extraction is compulsory and 
useful, especially with limited sensors. Since Masood’s and Dong’s methods used 


Table 2 The Evaluation results of different methods under the three evaluation criteria. P/A rep- 
resents Presence/Absence 


[30] [3] 
Classification | 71.46 66.67 72.31 70.21 74.38 76.04 


NRMSE 0.1912 0.2509 0.2322 0.2297 0. | 0.1574 | 0.1169 


Detection 93.13 90.21 92.38 88.54 a 21 95.42 
accuracy of 
P/A (%) 
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manually extracted features which will inevitably miss useful and implicit features, 
the performances of these methods are also limited for environmental sensor based 
human activity recognition. 

Owing to the deep structures for feature learning and temporal encoding of the 
DBLSTM approach, it is able to perform better than all the state-of-the-arts under 
these three evaluation criteria. With the powerful local feature extractor fulfilled 
by the convolutional network, the CDBLSTM further enhance the performance of 
DBLSTM. It outperforms all the approaches where the occupancy estimation accu- 
racy, the NRMSE and the detection accuracy are 76.04%, 0.1169 and 95.42%, respec- 
tively. 

We also illustrate the occupancy estimation results of all the testing days in Fig. 6, 
where useful insights can be concluded: 


— Candanedo’s and Yang’s approaches perform worse than other approaches, due to 
the use of raw data as features. With sensor noise and limited number of sensors, 
the raw sensor data is not representative for different occupancy levels. The more 
efficient way is to extract some representative features. 

— Since Masood’s exhaustively searches the best integration of features with the 
proposed wrapper method, it overfits on the testing data. Similarly, Dong’s method 
also cannot track occupancy profiles well with the handcrafted features. It can be 
concluded that handcrafted features lack a clear guideline and will inevitably miss 
useful and implicit features, which limited the system performance. 

— One interesting phenomenon is that the estimated occupancy suddenly increases at 
midnight for Candanedo’s, Masood’s and Yang’s approaches. By checking the data 
carefully, it should be caused by a sudden increase of CO? data. Then, the recorded 
video was checked, and we find that one subject siting near a pair of sensors usually 
walks around to prepare for leaving at that time. The optimal locations sensors 
will be considered as one of our future works [20]. Due to the sequential modeling 
capacity of HMM and the BLSTM structure, Dong’s approach, DBLSTM and 
CDBLSTM can almost immune to this issue caused by the increase of CO data. 

— With the deep structure for feature learning and the BLSTM network for temporal 
encoding, the DBLSTM and CDBLSTM approaches outperforms all the state-of- 
the-arts. 

— Owing to the convolutional network for local feature extraction, the CDBLSTM 
further enhances the performance of DBLSTM, and its better performance over 
all methodologies indicates the effectiveness of using CDBLSTM for building 
occupancy inference based on environmental sensors. 


Time complexity is a big concern about deep learning based methods. To show 
the time complexity of the CDBLSTM, we tested its training and testing time during 
experiments. Here, the state-of-the-art algorithms all based on manual feature extrac- 
tion and conventional machine learning algorithms have much smaller training and 
testing time when compared with CDBLSTM. The CDBLSTM 1s implemented with 
a computer which has dual core CPUs of Intel Xeon(R) E5-2697 v2 2.70 GHz and 
a GPU of NVIDIA Tesla K40c. Its training time is about 16min and 40s. Although 
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Fig. 6 The evaluation results of the testing data for all the methodologies [11] 


Deep Learning for Building Occupancy Estimation ... 351 


this amount of time for training is large, it 1s still acceptable because the training only 
requires to be done once in offline. The testing time of the CDBLSTM for all the 
samples (480 samples) is 0.35 s. This can be neglected for building control systems 
with a resolution of 15min. Hence, we can conclude that the CDBLSTM method 
can be used for real-time occupancy estimation with environmental sensors. 


4.4 HyperParameters 


Some hyperparameters are crucial for the CDBLSTM approach. Here, the parameters 
of the masking probabilities of the two dropout layers and the number of hidden 
layers are investigated. We explored three masking probability levels, including high 
(0.7), medium (0.5) and low (0.3). Figure 7 demonstrates the occupancy estimation 
accuracy of the CDBLSTM with different combinations of masking probability. We 
can find that the CDBLSTM may underfit with a degraded performance when high 
masking probabilities, such as the combinations of [0.7 0.7], [0.7 0.5], [0.5 0.7] 
and [0.5 0.5] are used. It is clear that a good selection of this hyperparameter will 
enhance the performance of CDBLSTM. The number of hidden layers is another 
key hyperparameter for the model. The estimation performance of the model with 
distinct number of hidden layers is shown in Fig.8. When the number of hidden 
layers increases from | to 3, the model performance improves. But, if the number of 
hidden layers is larger than 4 in this study, the model may overfit, resulting a limited 
performance. 
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4.5 The Impact of Noise 


The CDBLSTM approach is able to almost immune to some abnormal and noisy 
data as analyzed in Sect. 4.3, due to its ability to consider temporal dependencies in 
data. In order to explore the robustness of CDBLSTM on noise data, we manually 
include some noise into the raw sensor data. Figure 9 presents the performance of all 
the approaches with different noise levels. Note that the signal to noise ratio (SNR) 
is oo when no noise is added. When the SNR decreases (noisier), the performance of 
all the approaches degrade accordingly. Due to the capability of modeling temporal 
dependencies in data, the noise impact on the HMM model (Dong’s), DBLSTM 
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and CDBLSTM is smaller, which is consistent with the previous conclusion. The 
evaluation manifests that the CDBLSTM approach is robust against the noise in data. 


4.6 Generalization Performance 


In order to verify the generalization performance of the CDBLSTM method, addi- 
tional experiments are conducted. Specifically, we randomly select five days of data 
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for model testing and the rest for training. Note that, each day of data have equal 
probability to be chosen as training or testing, that guarantees the indication of the 
generalization capability of the CDBLSTM approach. We performed three times for 
the experiments. Figure 10 shows the final results. It can be found that the DBLSTM 
approach has a better performance than the state-of-the-arts, and CDBLSTM per- 
forms the best under the three evaluation criteria. The conclusions are the same as 
the previous analysis. This clearly manifests the good generalization performance 
of the CDBLSTM method for environmental sensor based occupancy detection and 
estimation. 


4.7 Additional Evaluation with Data from Another 
Environment 


To further evaluate the performance of the CDBLSTM, we perform an additional 
experiment with the data collected from a tutorial room. Totally, we collected four- 
teen workdays of data for evaluation, where we randomly choose eleven days of data 
for training and the rest for testing. A more comprehensive illustration of data is 
presented in [30]. The evaluation results of all the approaches is shown in Table 3. It 
can be found that all the approaches perform worse in this scenario. The reason is that 
we only deployed one pair of sensors in this large environment. To enhance the per- 
formance, more sensors should be deployed. In this evaluation, we can get the same 
conclusion. The DBLSTM outperforms all the state-of-the-arts. The CDBLSTM 
performs the best. This further manifests the effectiveness and robustness of the 
CDBLSTM approach for environmental sensor based building occupancy estima- 
tion. 


Table 3 Evaluation results in the tutorial room 


Criterion Dong’s [13] | Yang’s [38] DBLSTM | CDBLSTM 
[30] [3] 


Estimation 


imati 57.78 54.44 54.22 55.56 58.89 65.56 
accuracy (%) 


NRMSE 0.3768 3768 0.3201 0.3214 0.3296 0.2676 2676 0.2383 


Detection 


4 00 78.89 85.22 78.89 a 56 87.78 
accuracy of 
P/A (%) 
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5 Conclusion 


This chapter introduces a deep learning algorithm, termed Convolutional Deep B1- 
directional Long Short-Term Memory (CDBLSTM), for environmental sensor based 
occupancy inference in buildings. The CDBLSTM consists of a convolutional net- 
work for sequential local feature extraction from the raw environmental sensor data 
and a DBLSTM for temporal coding and feature learning. To verify the performance 
of CDBLSTM, we perform experiments in a research lab environment and compare 
with some existing approaches and the DBLSTM method without the convolutional 
operation. The results indicate that DBLSTM outperforms the state-of-the-arts and 
CDBLSTM has the best performance, which indicates the merits of the convolutional 
network and the DBLSTM structure for temporal encoding and feature learning. We 
also test some hyperparameters of the CDBLSTM with a conclusion that a proper 
selection of model hyperparameters will boost the performance of CDBLSTM. Then, 
the impact of noise on model performance is evaluated. The results manifests that the 
CDBLSTM 1s able to alleviate the noise effect due to its unique structure. After that, 
we test the generalization performance of the CDBLSTM by randomly selecting data 
for training and testing. We can obtain the same conclusion in this scenario. Finally, 
we perform an additional test in a tutorial room. Similarly, the CDBLSTM achieves 
a superior performance over all the other methodologies. 
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